AsaruSim is an automated Nextflow workflow designed for simulating 10x single-cell long read data from the count matrix level to the sequence level. It aimed at creating a gold standard dataset for the assessment and optimization of single-cell long-read methods. Full documentation is avialable here.
This pipeline is powered by Nextflow workflow manager. All dependencies are automatically managed by Nextflow through a preconfigured Docker container, ensuring a seamless and reproducible installation process.
Before starting, ensure the following tools are installed and properly set up on your system:
- Nextflow >= v24.04.4: A workflow engine for complex data pipelines. Installation guide for Nextflow.
- Docker or Singularity: Containers for packaging necessary software, ensuring reproducibility. Docker installation guide, Singularity installation guide.
- Git: Required to clone the workflow repository. Git installation guide.
Clone the AsaruSim
GitHub repository:
git clone https://github.com/alihamraoui/AsaruSim.git
cd AsaruSim
To test your installation, we provide an automated script to download reference annotations and simulate a subset of human PBMC dataset run_test.sh
.
bash run_test.sh
Customize runs by editing the nextflow.config
file and/or specifying parameters at the command line.
Here are the primary input parameters for configuring the workflow:
Parameter | Description | Format | Default Value |
---|---|---|---|
matrix |
Path to the count matrix csv file (required) | .CSV | test_data/matrix.csv |
transcriptome |
Path to the reference transcriptome file (required) | FASTA | test_data/transcriptome.fa |
bc_counts |
Path to the barcode count file (if no matrix provided). | .CSV | test_data/test_bc.csv |
Parameter | Description | Format | Default Value |
---|---|---|---|
features |
Matrix feature counts | STR | transcript_id |
cell_types_annotation |
Path to cell type annotation .csv file | CSV | null |
gtf |
Path to transcriptom annotation .gtf file | GTF | null |
umi_duplication |
UMI duplication | INT | 0 |
intron_retention |
Simulate intron retention proces | BOOL | false |
ir_model |
Intron retention MC model .CSV file | CSV | bin/models/SC3pv3_GEX_Human_IR_markov_model |
unspliced_ratio |
percentage of transcrits to be unspliced | FLOAT | 0.0 |
ref_genome |
reference genome .fasta file (if IR) | FASTA | null |
full_length |
Indicates if transcripts are full length | BOOL | false |
truncation_model |
Path to truncation probabilities .csv file | CSV | bin/models/truncation_default_model.csv |
Parameter | Description | Format | Default Value |
---|---|---|---|
pcr_cycles |
Number of PCR amplification cycles | INT | 0 |
pcr_error_rate |
PCR error rate | FLOAT | "0.0000001" |
pcr_dup_rate |
PCR duplication rate | FLOAT | 0.7 |
pcr_total_reads |
Name of the project | INT | 1000000 |
Configuration for error model:
Parameter | Description | format | Default Value |
---|---|---|---|
trained_model |
Badread pre-trained error/Qscore model name | STR | nanopore2023 |
badread_identity |
Comma-separated values for Badread identity parameters | STR | "98,2,99" |
error_model |
Custom error model file (optional) | .TXT | null |
qscore_model |
Custom Q-score model file (optional) | .TXT | null |
build_model |
to build your own error/Qscor model | STR | false |
fastq_model |
reference real read (.fastq) to train error model (optional) | FASTQ | false |
Parameter | Description | Format | Default Value |
---|---|---|---|
amp |
Amplification factor | INT | 1 |
outdir |
Output directory for results | PATH | "results" |
projectName |
Name of the project | STR | "test_project" |
Configuration for running the workflow:
Parameter | Description | Format | Default Value |
---|---|---|---|
threads |
Number of threads to use | INT | 4 |
container |
Docker container for the workflow | STR | 'hamraouii/asarusim:0.1' |
docker.runOptions |
Docker run options to use | STR | '-u $(id -u):$(id -g)' |
For more details about workflow options see the Input parameters section in the documentation.
To simulate specific UMI counts per cell barcode with random transcripts, set the --bc_counts parameter to the path of a UMI counts .CSV file. This parameter eliminates the need for an input matrix, enabling the simulation of UMI counts where transcripts are chosed randomly.
example of UMI counts per CB file:
CB | counts |
---|---|
ACGGCGATCGCGAGCC | 1260 |
ACGGCGATCGCGAGCC | 1104 |
AsaruSim allows user to estimate this characteristic from an existing count table. To do so, the user need to set --sim_celltypes parameter to true and to provide the list of cell barcodes of each group (.CSV file) using --cell_types_annotation parameter:
CB | cell_type |
---|---|
ACGGCGATCGCGAGCC | type 1 |
ACGGCGATCGCGAGCC | type 2 |
AsaruSim will then use the provided matrix to estimate characteristic of each cell groups and generate a synthetic count matrix.
User can choose among 4 ways to simulate template reads.
- use a real count matrix
- estimated the parameter from a real count matrix to simulate synthetic count matrix
- specified by his/her own the input parameter
- a combination of the above options
We use SPARSIM tools to simulate count matrix. for more information a bout synthetic count matrix, please read SPARSIM documentaion.
A demonstration dataset to initiate this workflow is accessible on zenodo DOI : 10.5281/zenodo.12731408. This dataset is a subsample from a Nanopore run of the 10X 5k human pbmcs.
The human GRCh38 reference transcriptome, gtf annotation and fasta referance genome can be downloaded from Ensembl.
You can use the run_test.sh
script to automatically download all required datasets.
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
--transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf dataset/GRCh38-2020-A-genes.gtf
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
--transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf dataset/GRCh38-2020-A-genes.gtf \
--pcr_cycles 2 \
--pcr_dup_rate 0.7 \
--pcr_error_rate 0.00003
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
--transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf dataset/GRCh38-2020-A-genes.gtf \
--sim_celltypes true \
--cell_types_annotation dataset/sub_pbmc_cell_type.csv
nextflow run main.nf --matrix Chu_param_preset \
--transcriptome datasets/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf datasets/Homo_sapiens.GRCh38.112.gtf
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
--transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf dataset/GRCh38-2020-A-genes.gtf \
--build_model true \
--fastq_model dataset/sub_pbmc_reads.fq \
--ref_genome dataset/GRCh38-2020-A-genome.fa
nextflow run main.nf --matrix dataset/sub_pbmc_matrice.csv \
--transcriptome dataset/Homo_sapiens.GRCh38.cdna.all.fa \
--features gene_name \
--gtf dataset/GRCh38-2020-A-genes.gtf \
--sim_celltypes true \
--cell_types_annotation dataset/sub_pbmc_cell_type.csv \
--build_model true \
--fastq_model dataset/sub_pbmc_reads.fq \
--ref_genome dataset/GRCh38-2020-A-genome.fa \
--pcr_cycles 2 \
--pcr_dup_rate 0.7 \
--pcr_error_rate 0.00003
After execution, results will be available in the specified --outdir
. This includes simulated Nanopore reads simulated.fastq.gz
, along with log file and QC report.
QC_report.html # final QC report
pipeline_info # Pipeline execution trace, timeline and Dag
simulated.fastq.gz # Simulated reads including sequencing errors
template.fa.gz # Simulated template
To clean up temporary files generated by Nextflow:
nextflow clean -f
- We would like to express our gratitude to Youyupei for the development of SLSim, which has been helpful to the
AsaruSim
workflow. - Additionally, our thanks go to the teams behind Badread, SPARSim and Trans-NanoSim whose tools are integral to the
AsaruSim
workflow.
For support, please open an issue in the repository's "Issues" section. Contributions via Pull Requests are welcome. Follow the contribution guidelines specified in CONTRIBUTING.md
.
AsaruSim
is distributed under a specific license. Check the LICENSE
file in the GitHub repository for details.
If you use AsaruSim in your research, please cite this manuscript:
Ali Hamraoui, Laurent Jourdren and Morgane Thomas-Chollier. AsaruSim: a single-cell and spatial RNA-Seq Nanopore long-reads simulation workflow. bioRxiv 2024.09.20.613625; doi: https://doi.org/10.1101/2024.09.20.613625