Skip to content

Commit

Permalink
This commit updates SourceTracker2 to significantly reduce memory foo…
Browse files Browse the repository at this point in the history
…tprint and improve the candidate API calls. The internal containers for data are now pandas DataFrames, significantly reducing parsing/indexing code. The README.md has been updated to reflect the new API access.
  • Loading branch information
wdwvt1 committed Aug 24, 2016
1 parent a7cbe05 commit 07f67e4
Show file tree
Hide file tree
Showing 9 changed files with 1,440 additions and 1,950 deletions.
19 changes: 19 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# this file is based on the examples provided on scikit-learn's .coveragerc
# and is adapted from scikit-bio's .coveragerc

[run]
omit =
*/tests*
*/__init__.py
source = sourcetracker
branch = True
include = */sourcetracker/*

[report]
exclude_lines =
pragma: no cover
raise NotImplementedError
if __name__ == .__main__.:
omit =
*/tests*
*/__init__.py
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,4 @@ target/
# example output
data/tiny-test/mixing_proportions
data/tiny-test/source_loo
data/tiny-test/example*
10 changes: 6 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,13 @@ before_install:
install:
- conda create --yes -n st2 python=$PYTHON_VERSION nose numpy scipy h5py flake8 scikit-bio=0.4.0
- source activate st2
- pip install coveralls
- pip install .
script:
- nosetests
- nosetests --with-coverage
- flake8 sourcetracker setup.py
- sourcetracker2 gibbs --help
- cd data/tiny-test/
- sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/
- sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type
- sourcetracker2 gibbs -i data/tiny-test/otu_table.biom -m data/tiny-test/map.txt -o example1/
- sourcetracker2 gibbs -i data/tiny-test/otu_table.biom -m data/tiny-test/alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type
after_success:
- coveralls
7 changes: 7 additions & 0 deletions ChangeLog.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,13 @@

## 2.0.1-dev (changes since 2.0.1 go here)

* A candidate public API has been created for both normal sink/source
prediction and leave-one-out (LOO) classification. These calls are
``_gibbs`` and ``_gibbs_loo``.
* The per-sink feature assignments are recorded for every run and written to
the output directory. They are named as X.contingency.txt where X is the
name of a sink.

## 2.0.1

* Initial alpha release.
Expand Down
113 changes: 98 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,101 @@
[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2)

# SourceTracker2
[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2) [![Coverage Status](https://coveralls.io/repos/github/biota/sourcetracker2/badge.svg)](https://coveralls.io/github/biota/sourcetracker2)

SourceTracker was originally described in [Knights et al., 2011](http://www.ncbi.nlm.nih.gov/pubmed/21765408).
If you use this package, please cite the original SourceTracker paper linked
above pending publication of SourceTracker 2.

# API vs. CLI

There are two ways to access the SourceTracker 2 functionality, via the command
line (CLI) or the python API. Users seeking to replicate the functionality
of SourceTracker 1 should use the command line functionality (`sourcetracker2 gibbs`)
Programmatic users are encouraged to use the API (exposed via `gibbs` and `gibbs_loo`).

The help documentation is broken down into sections with separate subsections for
API and CLI usage.

# File Formats

## Command Line
For descriptions of all file formats and options, please see the help
documentation, available with the command `sourcetracker2 gibbs --help`.

This script requires a feature X sample contingency table (traditionally an OTU
table) and sample metadata (traditionally a mapping file).

The feature X sample table is an `nXm` table (`n` rows, `m` columns) with sample IDs in the first column, and feature IDs in the first row. The values in each 'cell' of the table
must be integer counts.

The sample metadata file is an `sXk` table (`s` rows, `k` columns) with sample IDs in
the first column, and metadata headers in the first row. The values in each 'cell' of the
table can be any type of data, and respresent information about each sample.

Any feature table that can be read by the `biom-format >= 2.1.4` package will be
acceptable input. For example file formats please look at the test mapping file
and feature (OTU) tables we have included [here](https://github.com/biota/sourcetracker2/tree/master/data/tiny-test).

## API
For descriptions of the requirements, please see documentation in the `gibbs`
function. Very briefly, this function wraps the main workhorse function `_gibbs_sampler`
and exposes all the parameters necessary to control the behavior of the Gibb's sampling
as well as the parallel functionality etc.

A superficial but important difference from the CLI framework is that, internally,
SourceTracker 2 represents all tables as sample X feature (samples are rows,
columns are features). This reflects choices in Dan's original code, as well as
eases metadata based subsetting of tables. The API functions expect data in
sample X column format.

# Preprocessing

## Command line
Input feature data should be counts. If non-count data (e.g. the count of
feature i in sample j was 4.63) is passed, the 'ceiling' of the data will be
taken. This means that each non-integer count will be rounded up to the nearest
larger integer.

Rarefaction is performed by default at 1000 seqs/sample for both sinks and
sources. This is done to prevent samples with more counts from dominating the
contributions. Rarefaction depth can be set (or entirely disabled) with the ``--source_rarefaction_depth`` and ``--sink_rarefaction_depth`` parameters. Source
samples which are collapsed are rarefied after collapse.

Samples which are not present in both the input feature table and the metadata
are excluded from the analysis.

Samples which come from the same source environment are 'collapsed', meaning
their mean value for each feature is computed and used in the analysis. See the
'Theory' section below for a discussion of this approach.

## API
The `gibbs` and `gibbs_loo` functions due minimal preprocessing on the input data.
The source and sink dataframes are treated as final (no collapsing is done on them),
i.e. each sink is treated independently. The data is **not** rarified, the tables
are expected to have the desired row sums.


# Output

## Command line
There are two default output files, the `mixing_proporitions.txt` and
`mixing_proportion_stds.txt`. `mixing_proporitions.txt` is a tab-separated contingency table with sinks as rows and sources as columns. The values in the table are the
mean fractional contributions of each source to each sink. `mixing_proporitions_stds`
has the same format, but contains the standard deviation of each fractional contribution.

Optionally, you can create per-sink feature X sample tables with the `--per_sink_feature_assignments` flag. The per-sink feature tables are labeled with
the name of the sink. For example, if we have a sink called 'hand_sample3' the
output feature table would be 'hand_sample3.feature_table.txt'. These tables record the
origin source of each sink sequence (count of a feature).

## API
The outputs of the `gibbs` and `gibbs_loo` functions are identical to the command line
outputs, just in dataframe form.


# Documentation

This script replicates and extends the functionality of Dan Knights's
SourceTracker R package.

The `mapping file` which describes the `sources` and `sinks` must be
formatted in the same way it was for the SourceTracker R package. Specifically,
there must be a column `SourceSink` and a column `Env`. For an example, look
at `sourcetracker2/data/tiny-test/`.
SourceTracker R package.

A major improvement in this version of SourceTracker is the ability to run it in parallel.
Currently, parallelization across a single machine is
Expand All @@ -35,7 +116,7 @@ SourceTracker2 is Python 3 software. The easiest way to install it is using Anac
To install SourceTracker 2 using Anaconda, run the following commands:

```bash
conda create -n st2 python=3.5 numpy scipy h5py hdf5 matplotlib
conda create -n st2 python=3.5 numpy scipy scikit-bio=0.4.3 biom-format h5py hdf5
source activate st2
pip install sourcetracker
```
Expand Down Expand Up @@ -141,7 +222,7 @@ These usage examples expect that you are in the directory
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/`

**Calculate the proportion of each source in each sink using an alternate sample metadata mapping file where samples are described differently.**
`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type`
`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value sink --source_category_column sample-type`

**Calculate the class label (i.e. 'Env') of each source using a leave one out
strategy**
Expand All @@ -151,19 +232,21 @@ strategy**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example4/ --burnin 100`

**Calculate the proportion of each source in each sink, using a sink
rarefaction depth of 1500**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 2500`
rarefaction depth of 1700 and source rarefaction depth of 2000**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 1700 --source_rarefaction_depth 2000`

**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5`

**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the per sink feature tables (what SourceTracker 1 called
'full output'). These are feature by sample table indicating the origin source of each sequence (each count of a feature).**
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments`


# Miscellaneous

The current implementation of SourceTracker 2 does not contain functionality for
visualization of results or auto-tuning of the parameters (`alpha1`, `alpha1`,
etc.). For an example of how you might visualize the data, please see
this [Juypter notebook](https://github.com/biota/SourceTracker2/blob/master/ipynb/Visualizing%20results.ipynb).
For auto-tuning functionality, please see the original R code.

Like the old SourceTracker, SourceTracker2 rarifies the source environments it
collapses by default.
Loading

0 comments on commit 07f67e4

Please sign in to comment.