Skip to content

Commit

Permalink
Merge pull request #10 from hisplan/dev
Browse files Browse the repository at this point in the history
v0.2.6
  • Loading branch information
hisplan authored Oct 7, 2020
2 parents c748101 + 93d234e commit 34fd775
Show file tree
Hide file tree
Showing 49 changed files with 4,652 additions and 2,192 deletions.
33 changes: 33 additions & 0 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
version: 2.1

orbs:
python: circleci/[email protected]

jobs:
build-and-test:
executor: python/default
steps:
- checkout
- python/load-cache
- run:
name: Install cython/numpy/bhtsne
command: |
pip install Cython
pip install numpy
pip install bhtsne
- python/install-deps
- python/save-cache
- run:
name: Install seqc
command: pip install .
- run:
name: Test
command: |
export TMPDIR="/tmp"
python -m nose2 -s src/seqc/tests test_run_rmt_correction
workflows:
main:
jobs:
- build-and-test
38 changes: 38 additions & 0 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Python application

on: [push, pull_request]

jobs:
build:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install flake8 pytest
pip install Cython
pip install numpy
pip install bhtsne
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Install SEQC
run: pip install .
- name: Test with nose2
run: |
export TMPDIR="/tmp"
nose2 -s src/seqc/tests test_run_rmt_correction
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ dist/*
.project
.pydevproject
.c9/
test-data/
dask-worker-space/

187 changes: 127 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,150 @@
## SEquence Quality Control (SEQC -- /sek-si:/)
# SEquence Quality Control (SEQC -- /sek-si:/)

## Overview:

SEQC is a python package that processes single-cell sequencing data in the cloud and analyzes it interactively on your local machine.

To faciliate easy installation and use, we have made available Amazon Machine Images (AMIs) that come with all of SEQC's dependencies pre-installed. In addition, we have uploaded common genome indices (`-i/--index parameter`) and barcode data (`--barcode-files`) to public amazon s3 repositories. These links can be provided to SEQC and it will automatically fetch them prior to initiating an analysis run. Finally, it can fetch input data directly from BaseSpace or amazon s3 for analysis.
To faciliate easy installation and use, we have made available Amazon Machine Images (AMIs) that come with all of SEQC's dependencies pre-installed. In addition, we have uploaded common genome indices (`-i/--index parameter`) and barcode data (`--barcode-files`) to public Amazon S3 repositories. These links can be provided to SEQC and it will automatically fetch them prior to initiating an analysis run. Finally, it can fetch input data directly from BaseSpace or amazon s3 for analysis.

For users with access to in-house compute clusters, SEQC can be installed on your systems and run using the --local parameter.
For users with access to in-house compute clusters, SEQC can be installed on your systems and run using the `--local` parameter.

### Dependencies:
## Dependencies:

### Python 3

#### Python3
Python must be installed on your local machine to run SEQC. We recommend installing python3 through your unix operating system's package manager. For Mac OSX users we recommend <a href=http://brew.sh/>homebrew</a>. Typical installation commands would be:
Python3 must be installed on your local machine to run SEQC. We recommend installing Python3 through Miniconda (https://docs.conda.io/en/latest/miniconda.html).

brew install python3 # mac
apt-get install python3 # debian
yum install python3 # rpm-based
### Python 3 Libraries

#### Python3 Libraries
We recommend creating a virtual environment before installing anything:

Installing these libraries is necessary before installing SEQC.
```bash
conda create -n seqc python=3.7.7 pip
conda activate seqc
```

pip3 install Cython
pip3 install numpy
pip3 install bhtsne
```bash
pip install Cython
pip install numpy
pip install bhtsne
```

#### STAR
To process data locally using SEQC, you must install the <a href=https://github.com/alexdobin/STAR>STAR Aligner</a>, <a href=http://www.htslib.org/>Samtools</a>, and <a href=https://support.hdfgroup.org/HDF5/>hdf5</a>. If you only intend to use SEQC to trigger remote processing on AWS, these dependencies are optional. We recommend installing samtools and hdf5 through your package manager, if possible.

#### Hardware Requirements:
For processing a single lane (~200M reads) against human- and mouse-scale genomes, SEQC requires 30GB RAM, approximately 200GB free hard drive space, and scales linearly with additional compute cores. If running on AWS (see below), jobs are automatically scaled up or down according to the size of the input. There are no hardware requirements for the computer used to launch remote instances.


#### Amazon Web Services:
SEQC can be run on any unix-based operating system, however it also features the ability to automatically spawn Amazon Web Services instances to process your data. If you wish to take advantage of AWS, you will need to follow their instructions to:

1. <a href=http://aws.amazon.com>Set up an AWS account</a>
2. <a href=https://aws.amazon.com/cli/>Install and configure AWS CLI</a>
3. <a href=http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html>Create and upload an rsa-key for AWS</a>


### SEQC Installation:
### STAR, Samtools, and HDF5

Once all dependencies have been installed, SEQC can be installed on any machine by typing:

$> git clone https://github.com/dpeerlab/seqc.git
$> cd seqc && python3 setup.py install

Please note that to avoid passing the -k/--rsa-key command when you execute SEQC runs, you can also set the environment variable `AWS_RSA_KEY` to the path to your newly created key.

### Testing SEQC:

All the unit tests in class `TestSEQC` in `test.py` have been tested. Currently, only two platforms `ten_x_v2` and `in_drop_v2` have been tested. Old unit tests from these two platforms together with other platforms are stored at `s3://dp-lab-data/seqc-old-unit-test/`.

### Running SEQC:

After SEQC is installed, help can be listed:
To process data locally using SEQC, you must install the <a href=https://github.com/alexdobin/STAR>STAR Aligner</a>, <a href=http://www.htslib.org/>Samtools</a>, and <a href=https://support.hdfgroup.org/HDF5/>hdf5</a>. If you only intend to use SEQC to trigger remote processing on AWS, these dependencies are optional. We recommend installing samtools and hdf5 through your package manager, if possible.

SEQC [-h] [-v] {run,progress,terminate,instances,start,index} ...
## SEQC Installation

Processing Tools for scRNA-seq Experiments
Once all dependencies have been installed, SEQC can be installed by running:

positional arguments:
{run,progress,terminate,instances,start,index}
run initiate SEQC runs
progress check SEQC run progress
terminate terminate SEQC runs
instances list all running instances
start initialize a seqc-ready instance
index create a SEQC index
```bash
export SEQC_VERSION="0.2.6"
wget https://github.com/hisplan/seqc/archive/v${SEQC_VERSION}.tar.gz
tar xvzf v${SEQC_VERSION}.tar.gz
cd seqc-${SEQC_VERSION}
pip install .
```

optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
## Hardware Requirements:

In addition to processing sequencing experiments, SEQC.py provides some convenience tools to create indices for use with SEQC and STAR, and tools to check the progress of remote runs, list current runs, start instances, and terminate them.
For processing a single lane (~200M reads) against human- and mouse-scale genomes, SEQC requires 30GB RAM, approximately 200GB free hard drive space, and scales linearly with additional compute cores. If running on AWS (see below), jobs are automatically scaled up or down according to the size of the input. There are no hardware requirements for the computer used to launch remote instances.

To seamlessly start an AWS instance with automatic installation of SEQC from your local machine you can run:
## Running SEQC on Local Machine:

Download an example dataset (1k PBMCs from a healthy donor; freely available at 10x Genomics https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_1k_v3):

```bash
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar
tar xvf pbmc_1k_v3_fastqs.tar
```

Move R1 FASTQ files to the `barcode` folder and R2 FASTQ files to the `genomic` folder:

```bash
mkdir barcode
mkdir genomic
mv ./pbmc_1k_v3_fastqs/*R1*.fastq.gz barcode/
mv ./pbmc_1k_v3_fastqs/*R2*.fastq.gz genomic/
```

Download the 10x barcode whitelist file:

```bash
mkdir whitelist
wget https://seqc-public.s3.amazonaws.com/barcodes/ten_x_v3/flat/3M-february-2018.txt
mv 3M-february-2018.txt ./whitelist/
```

The resulting directory structure should look something like this:

```
.
├── barcode
│   ├── pbmc_1k_v3_S1_L001_R1_001.fastq.gz
│   └── pbmc_1k_v3_S1_L002_R1_001.fastq.gz
├── genomic
│   ├── pbmc_1k_v3_S1_L001_R2_001.fastq.gz
│   └── pbmc_1k_v3_S1_L002_R2_001.fastq.gz
├── pbmc_1k_v3_fastqs
│   ├── pbmc_1k_v3_S1_L001_I1_001.fastq.gz
│   └── pbmc_1k_v3_S1_L002_I1_001.fastq.gz
├── pbmc_1k_v3_fastqs.tar
└── whitelist
└── 3M-february-2018.txt
```

Create a reference package (STAR index + gene annotation):

```bash
SEQC index \
--organism homo_sapiens \
--ensemble-release 93 \
--valid-biotypes protein_coding lincRNA antisense IG_V_gene IG_D_gene IG_J_gene IG_C_gene TR_V_gene TR_D_gene TR_J_gene TR_C_gene \
--read-length 101 \
--folder index \
--local
```

Run SEQC:

```bash
export AWS_DEFAULT_REGION=us-east-1
export SEQC_MAX_WORKERS=7

SEQC run ten_x_v3 \
--index ./index/ \
--barcode-files ./whitelist/ \
--barcode-fastq ./barcode/ \
--genomic-fastq ./genomic/ \
--upload-prefix ./seqc-results/ \
--output-prefix PBMC \
--no-filter-low-coverage \
--min-poly-t 0 \
--star-args runRNGseed=0 \
--local
```

## Running SEQC on Amazon Web Services:

SEQC can be run on any unix-based operating system, however it also features the ability to automatically spawn Amazon Web Services instances to process your data.

SEQC start
1. <a href=http://aws.amazon.com>Set up an AWS account</a>
2. <a href=https://aws.amazon.com/cli/>Install and configure AWS CLI</a>
3. <a href=http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html>Create and upload an rsa-key for AWS</a>

Run SEQC:

```bash
SEQC run ten_x_v2 \
--ami-id ami-08652ee2477761403 \
--user-tags Job:Test,Project:PBMC-Test,Sample:pbmc_1k_v3 \
--index s3://seqc-public/genomes/hg38_long_polya/ \
--barcode-files s3://seqc-public/barcodes/ten_x_v2/flat/ \
--genomic-fastq s3://.../genomic/ \
--barcode-fastq s3://.../barcode/ \
--upload-prefix s3://.../seqc-results/ \
--output-prefix PBMC \
--no-filter-low-coverage \
--min-poly-t 0 \
--star-args runRNGseed=0
```
44 changes: 44 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# docs

## Developers

- [Environment setup for development](./install-dev.md)
- [Running test](./run-test.md)


## Generating Reference Packages

This generates a reference package (STAR index and GTF) using SEQC v0.2.6.

- Ensembl 86
- Gene annotation file that contains only the reference chromosomes (no scaffolds, no patches)
- Only these biotypes: 'protein_coding', 'lincRNA', 'IG_V_gene', 'IG_C_gene', 'IG_J_gene', 'TR_C_gene', 'TR_J_gene', 'TR_V_gene', 'TR_D_gene', 'IG_D_gene'
- Not passing anything to `--additional-id-types`
- Setting the read length to 101 (internally, this becomes 100)

### Local

```bash
SEQC index \
-o homo_sapiens \
-f homo_sapiens \
--ensemble-release 93 \
--valid-biotypes protein_coding lincRNA antisense IG_V_gene IG_D_gene IG_J_gene IG_C_gene TR_V_gene TR_D_gene TR_J_gene TR_C_gene \
--read-length 101 \
--folder ./test-data/index/ \
--local
```

### AWS

```bash
SEQC index \
-o homo_sapiens \
-f homo_sapiens \
--ensemble-release 93 \
--valid-biotypes protein_coding lincRNA antisense IG_V_gene IG_D_gene IG_J_gene IG_C_gene TR_V_gene TR_D_gene TR_J_gene TR_C_gene \
--read-length 101 \
--upload-prefix s3://dp-lab-test/seqc/index/86/ \
--rsa-key ~/dpeerlab-chunj.pem \
--ami-id ami-037cc8c1417e197c1
```
50 changes: 50 additions & 0 deletions docs/install-SUSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Installation for SUSE

This was tested with AWS SUSE Linux Enterprise Server 15 SP1 (HVM).

## Install gcc & c++

```bash
sudo zypper in gcc-c++
```

## Install Miniconda

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
```

For more information:
- https://docs.conda.io/en/latest/miniconda.html
- https://conda.io/projects/conda/en/latest/user-guide/install/linux.html#install-linux-silent

Log out log back in.

## Create a Virtual Environment

```bash
conda create -n seqc python=3.7.7 pip
conda activate seqc
```

## Install dependencies

```
pip install Cython
pip install numpy
pip install bhtsne
conda install -c anaconda hdf5
conda install -c bioconda samtools
conda install -c bioconda star
```

## Install SEQC

```
wget https://github.com/dpeerlab/seqc/archive/v0.2.6.tar.gz
tar xvzf v0.2.6.tar.gz
cd seqc-0.2.6/
pip install .
```
Loading

0 comments on commit 34fd775

Please sign in to comment.