-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major overhaul of internals #51
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# this file is based on the examples provided on scikit-learn's .coveragerc | ||
# and is adapted from scikit-bio's .coveragerc | ||
|
||
[run] | ||
omit = | ||
*/tests* | ||
*/__init__.py | ||
source = sourcetracker | ||
branch = True | ||
include = */sourcetracker/* | ||
|
||
[report] | ||
exclude_lines = | ||
pragma: no cover | ||
raise NotImplementedError | ||
if __name__ == .__main__.: | ||
omit = | ||
*/tests* | ||
*/__init__.py |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -69,3 +69,4 @@ target/ | |
# example output | ||
data/tiny-test/mixing_proportions | ||
data/tiny-test/source_loo | ||
data/tiny-test/example* |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,101 @@ | ||
[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2) | ||
|
||
# SourceTracker2 | ||
[![Build Status](https://travis-ci.org/biota/sourcetracker2.svg?branch=master)](https://travis-ci.org/biota/sourcetracker2) [![Coverage Status](https://coveralls.io/repos/github/biota/sourcetracker2/badge.svg)](https://coveralls.io/github/biota/sourcetracker2) | ||
|
||
SourceTracker was originally described in [Knights et al., 2011](http://www.ncbi.nlm.nih.gov/pubmed/21765408). | ||
If you use this package, please cite the original SourceTracker paper linked | ||
above pending publication of SourceTracker 2. | ||
|
||
# API vs. CLI | ||
|
||
There are two ways to access the SourceTracker 2 functionality, via the command | ||
line (CLI) or the python API. Users seeking to replicate the functionality | ||
of SourceTracker 1 should use the command line functionality (`sourcetracker2 gibbs`) | ||
Programmatic users are encouraged to use the API (exposed via `gibbs` and `gibbs_loo`). | ||
|
||
The help documentation is broken down into sections with separate subsections for | ||
API and CLI usage. | ||
|
||
# File Formats | ||
|
||
## Command Line | ||
For descriptions of all file formats and options, please see the help | ||
documentation, available with the command `sourcetracker2 gibbs --help`. | ||
|
||
This script requires a feature X sample contingency table (traditionally an OTU | ||
table) and sample metadata (traditionally a mapping file). | ||
|
||
The feature X sample table is an `nXm` table (`n` rows, `m` columns) with sample IDs in the first column, and feature IDs in the first row. The values in each 'cell' of the table | ||
must be integer counts. | ||
|
||
The sample metadata file is an `sXk` table (`s` rows, `k` columns) with sample IDs in | ||
the first column, and metadata headers in the first row. The values in each 'cell' of the | ||
table can be any type of data, and respresent information about each sample. | ||
|
||
Any feature table that can be read by the `biom-format >= 2.1.4` package will be | ||
acceptable input. For example file formats please look at the test mapping file | ||
and feature (OTU) tables we have included [here](https://github.com/biota/sourcetracker2/tree/master/data/tiny-test). | ||
|
||
## API | ||
For descriptions of the requirements, please see documentation in the `gibbs` | ||
function. Very briefly, this function wraps the main workhorse function `_gibbs_sampler` | ||
and exposes all the parameters necessary to control the behavior of the Gibb's sampling | ||
as well as the parallel functionality etc. | ||
|
||
A superficial but important difference from the CLI framework is that, internally, | ||
SourceTracker 2 represents all tables as sample X feature (samples are rows, | ||
columns are features). This reflects choices in Dan's original code, as well as | ||
eases metadata based subsetting of tables. The API functions expect data in | ||
sample X column format. | ||
|
||
# Preprocessing | ||
|
||
## Command line | ||
Input feature data should be counts. If non-count data (e.g. the count of | ||
feature i in sample j was 4.63) is passed, the 'ceiling' of the data will be | ||
taken. This means that each non-integer count will be rounded up to the nearest | ||
larger integer. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "If the user wishes to take the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know that the statement by @lkursell is necessary, but rather that it's implied here (since it says that only counts are accepted). Another way to go would be to fail if something other than integer counts are provided. I think that might be preferable, since they're really doing something wrong if a floating point value is encountered. numpy should fail if you try to cast an array to int when there are values that cannot be safely converted to ints. |
||
|
||
Rarefaction is performed by default at 1000 seqs/sample for both sinks and | ||
sources. This is done to prevent samples with more counts from dominating the | ||
contributions. Rarefaction depth can be set (or entirely disabled) with the ``--source_rarefaction_depth`` and ``--sink_rarefaction_depth`` parameters. Source | ||
samples which are collapsed are rarefied after collapse. | ||
|
||
Samples which are not present in both the input feature table and the metadata | ||
are excluded from the analysis. | ||
|
||
Samples which come from the same source environment are 'collapsed', meaning | ||
their mean value for each feature is computed and used in the analysis. See the | ||
'Theory' section below for a discussion of this approach. | ||
|
||
## API | ||
The `gibbs` and `gibbs_loo` functions due minimal preprocessing on the input data. | ||
The source and sink dataframes are treated as final (no collapsing is done on them), | ||
i.e. each sink is treated independently. The data is **not** rarified, the tables | ||
are expected to have the desired row sums. | ||
|
||
|
||
# Output | ||
|
||
## Command line | ||
There are two default output files, the `mixing_proporitions.txt` and | ||
`mixing_proportion_stds.txt`. `mixing_proporitions.txt` is a tab-separated contingency table with sinks as rows and sources as columns. The values in the table are the | ||
mean fractional contributions of each source to each sink. `mixing_proporitions_stds` | ||
has the same format, but contains the standard deviation of each fractional contribution. | ||
|
||
Optionally, you can create per-sink feature X sample tables with the `--per_sink_feature_assignments` flag. The per-sink feature tables are labeled with | ||
the name of the sink. For example, if we have a sink called 'hand_sample3' the | ||
output feature table would be 'hand_sample3.feature_table.txt'. These tables record the | ||
origin source of each sink sequence (count of a feature). | ||
|
||
## API | ||
The outputs of the `gibbs` and `gibbs_loo` functions are identical to the command line | ||
outputs, just in dataframe form. | ||
|
||
|
||
# Documentation | ||
|
||
This script replicates and extends the functionality of Dan Knights's | ||
SourceTracker R package. | ||
|
||
The `mapping file` which describes the `sources` and `sinks` must be | ||
formatted in the same way it was for the SourceTracker R package. Specifically, | ||
there must be a column `SourceSink` and a column `Env`. For an example, look | ||
at `sourcetracker2/data/tiny-test/`. | ||
SourceTracker R package. | ||
|
||
A major improvement in this version of SourceTracker is the ability to run it in parallel. | ||
Currently, parallelization across a single machine is | ||
|
@@ -35,7 +116,7 @@ SourceTracker2 is Python 3 software. The easiest way to install it is using Anac | |
To install SourceTracker 2 using Anaconda, run the following commands: | ||
|
||
```bash | ||
conda create -n st2 python=3.5 numpy scipy h5py hdf5 matplotlib | ||
conda create -n st2 python=3.5 numpy scipy scikit-bio=0.4.3 biom-format h5py hdf5 | ||
source activate st2 | ||
pip install sourcetracker | ||
``` | ||
|
@@ -141,7 +222,7 @@ These usage examples expect that you are in the directory | |
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example1/` | ||
|
||
**Calculate the proportion of each source in each sink using an alternate sample metadata mapping file where samples are described differently.** | ||
`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value snk --source_category_column sample-type` | ||
`sourcetracker2 gibbs -i otu_table.biom -m alt-map.txt -o example2/ --source_sink_column source-or-sink --source_column_value src --sink_column_value sink --source_category_column sample-type` | ||
|
||
**Calculate the class label (i.e. 'Env') of each source using a leave one out | ||
strategy** | ||
|
@@ -151,19 +232,21 @@ strategy** | |
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example4/ --burnin 100` | ||
|
||
**Calculate the proportion of each source in each sink, using a sink | ||
rarefaction depth of 1500** | ||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 2500` | ||
rarefaction depth of 1700 and source rarefaction depth of 2000** | ||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example5/ --sink_rarefaction_depth 1700 --source_rarefaction_depth 2000` | ||
|
||
**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs** | ||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example6/ --jobs 5` | ||
|
||
**Calculate the proportion of each source in each sink, using ipyparallel to run in parallel with 5 jobs. Write the per sink feature tables (what SourceTracker 1 called | ||
'full output'). These are feature by sample table indicating the origin source of each sequence (each count of a feature).** | ||
`sourcetracker2 gibbs -i otu_table.biom -m map.txt -o example7/ --jobs 5 --per_sink_feature_assignments` | ||
|
||
|
||
# Miscellaneous | ||
|
||
The current implementation of SourceTracker 2 does not contain functionality for | ||
visualization of results or auto-tuning of the parameters (`alpha1`, `alpha1`, | ||
etc.). For an example of how you might visualize the data, please see | ||
this [Juypter notebook](https://github.com/biota/SourceTracker2/blob/master/ipynb/Visualizing%20results.ipynb). | ||
For auto-tuning functionality, please see the original R code. | ||
|
||
Like the old SourceTracker, SourceTracker2 rarifies the source environments it | ||
collapses by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X.full_results.txt
would match Dan's original nomenclature from his most recent release, and is a bit more descriptive.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to change this, but let me give my rationale for this.
X.full_results.txt
doesn't suggest what the object is. The full output is a contingency table of sources X features.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. We've been moving away from contingency table toward feature table. What if it was
X.table.txt
? If you do this, you should indicate that these are the same as theX.full_results.txt
files from SourceTracker 1.