Skip to content

Rapid and scalable correlation estimation for compositional data

License

Notifications You must be signed in to change notification settings

scwatts/fastspar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastSpar

Rapid and scalable correlation estimation for compositional data.

Table of contents

Introduction

FastSpar is a C++ implementation of the SparCC algorithm which is up to several thousand times faster than the original Python2 release and uses much less memory. The FastSpar implementation provides threading support and a p-value estimator which accounts for the possibility of repetitious data permutations (see this paper for further details).

Citation

If you use this tool, please cite the FastSpar paper and original SparCC paper:

Requirements

There are no requirements for using the pre-compiled static binaries on 64-bit linux distributions. Otherwise, there are several libraries which are required for building and running dynamically linked binaries. For further information, see Compiling from source.

Installation

FastSpar can be installed using conda or from source.

Conda

To install through conda, use:

conda install -c bioconda -c conda-forge fastspar

Compiling from source

Compiling from source requires these libraries and software:

C++11 (gcc-4.9.0+, clang-4.9.0+, etc)
OpenMP 4.0+
Gfortran
Armadillo 6.7+
LAPACK
OpenBLAS
GNU Scientific Library 2.1+
GNU getopt
GNU make
GNU autoconf
GNU autoconf-archive

These dependencies can be install with the following packages on ubuntu 20.04:

build-essential
gfortran
dh-autoreconf
libarmadillo-dev
libopenblas-openmp-dev
libgsl-dev

After meeting the above requirements, compiling and installing FastSpar from source can be done by:

git clone https://github.com/scwatts/fastspar.git
cd fastspar
./autogen.sh
./configure --prefix=/usr/
make
make install

Once completed, the FastSpar executables can be run from the command line.

Usage

Correlation inference

To run FastSpar, you must have absolute OTU counts in BIOM tsv format file (with no metadata). The fake_data.tsv (from the original SparCC implementation) will be used as an example:

fastspar --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

The number of iterations (rounds of SparCC correlation estimation) and exclusion iterations (the number of times highly correlation OTU pairs are discovered and excluded) can also be tweaked:

fastspar --iterations 50 --exclude_iterations 20 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Further, the minimum threshold to exclude correlated OTU pairs can be increased:

fastspar --threshold 0.2 --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --covariance median_covariance.tsv

Calculation of exact p-values

There are several methods to calculate p-values for inferred correlations. Here we have elected to use a robust permutation based approach. This process involves inferring correlation from random permutations of the original OTU count data. The magnitude of each p-value is related to how often a more extreme correlation is observed for randomly permutated data. In the below example, we calculate p-values from 1000 bootstrap correlations.

First we generate the 1000 bootstrap counts:

mkdir bootstrap_counts
fastspar_bootstrap --otu_table tests/data/fake_data.tsv --number 1000 --prefix bootstrap_counts/fake_data

And then infer correlations for each bootstrap count (running in parallel with all processes available):

mkdir bootstrap_correlation
parallel fastspar --otu_table {} --correlation bootstrap_correlation/cor_{/} --covariance bootstrap_correlation/cov_{/} -i 5 ::: bootstrap_counts/*

From these correlations, the p-values are then calculated:

fastspar_pvalues --otu_table tests/data/fake_data.tsv --correlation median_correlation.tsv --prefix bootstrap_correlation/cor_fake_data_ --permutations 1000 --outfile pvalues.tsv

Threading

If FastSpar is compiled with OpenMP, threading can be used by invoking --threads <thread_number> at the command line for several tools:

fastspar --otu_table tests/data/fake_data.txt --correlation median_correlation.tsv --covariance median_covariance.tsv --iterations 50 --threads 10

Contributors

  • sritchie73
    • Advised on use of permutation based statistical testing
    • Provided an example use of statmod::permp
  • epruesse
    • Created bioconda recipe

License

GNU General Public License v3.0

About

Rapid and scalable correlation estimation for compositional data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages