DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets under The MIT License.
Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.
Dekupl-run is a pipeline built with Snakemake. It works with a configuration file that you will use to set the list of samples and their conditions as well as parameters for the test.
-
Create a config.json with the list of your samples, their conditions and the location of their FASTQ file. See next section for parameter description.
-
Run the pipeline. Replace
CONFIG_JSON
with the config file you have created,NB_THREADS
with the number of threads andMAX_MEMORY
with the maximum memory (in Megabyte) you want DE-kupl to allocate. This command line can varry depending of the installation (docker, singularity, manual, etc).dekupl-run --configfile CONFIG_JSON -jNB_THREADS --resources ram=MAX_MEMORY -p
-
Explore results. Once Dekupl-run has been successfully executed, DE contigs produced by Dekupl-run are located under
DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz
. They can be annoted using Dekupl-annotation and vizualized with Dekupl-viewer.
We recommand tu use singularity to install dekupl-run, but you can also use Docker, and manual installation.
One can create a singularity container from the docker image. Two methods are available, they should both work.
- Step 1: Build Singularity image
singularity build dekupl-run.simg docker://transipedia/dekupl-run:1.3.5
It's advised to mount some volumes (input/output directories). To mount the "/store" volume you should use "--bind /store:/store". That way, you can access the /store directory (in your configuration file, notably). Make sure your config.json is in the same folder as dekupl-run.simg.
- Step 2: Use dekupl-run with mounted volumes
singularity run --bind /store:/store ./dekupl-run.simg --configfile config.json -jNB_THREADS
- Step 1: Retrieve the docker image.
docker pull transipedia/dekupl-run:1.3.5
- Step 2: Run dekupl-run.
You may need to mount some volumes :
- Your
my-config.json
to/dekupl/my-config.json
- Your fastq_dir (the one defined in your
config.json
) to/dekupl/FASTQ_DIR
- Your output_dir (the one defined in your
config.json
) to/dekupl/OUTPUT_DIR
- Any other necessary folder depending on your
config.json
docker run --rm -v ${PWD}/my-config.json:/dekupl/my-config.json \ -v ${PWD}/data:/dekupl/data -v ${PWD}/results:/dekupl/results \ transipedia/dekupl-run --configfile my-config.json \ -jNB_THREADS --resources ram=MAX_MEMORY -p
- Your
- Step 1: Install dependancies. Before using Dekupl-run, install these dependencies:
- Snakemake, jellyfish, pigz, CMake, boost
- R packages (DESEq2, RColorBrewer, pheatmap, foreach, doParallel)
Rscript install_r_packages.R
- Step 2: Clone this repository including submodules.
git clone --recursive https://github.com/Transipedia/dekupl-run.git
- Step 3: Edit config file & run dekupl-run with Snakemake.
snakemake -jNB_THREADS --resources ram=MAX_MEMORY -p
Here is an example of a minimal config file with only mandatory information. You can copy this base and adapt it to your needs (see following paragraphs).
The parameter samples
containing the list of samples with their associated conditions can be replaced with a TSV file using the samples_tsv
option (see below).
Note : even though an arbitrary config file name can be specified on the command line (using --configfile), a non-empty file named ‘config.json’ must be present in the current directory. ‘config.json’ will be overriden by the name specified on the command line.
{
"fastq_dir": "data",
"dekupl_counter": {
"min_recurrence": 2,
"min_recurrence_abundance": 5
},
"diff_analysis": {
"condition" : {
"A": "A",
"B": "B"
},
"pvalue_threshold": 0.05,
"log2fc_threshold": 2
},
"samples": [{
"name": "sample1",
"condition": "A"
}, {
"name" : "sample2",
"condition" : "A"
}, {
"name" : "sample3",
"condition" : "B"
}, {
"name" : "sample4",
"condition" : "B"
}
]
}
How can I use DEkupl-run with non-human data ?
You need to specify your own FASTA using the transcript_fasta
option as well as file with mapping of transcript_id to gene_id with the transcript_to_gene
option.
How can I use DEkupl-run with single-end reads?
Set parameter lib_type
to "single". You can also specify fragments length (see section Configuration for single-end libraries)
- fastq_dir: Location of FASTQ files
- kmer_length: Length of k-mers (default: 31). This value shoud not exceed 32.
- diff_method: Method used for k-mer differential testing (default: DESeq2). Possible choices are 'Ttest' which is the fastest, 'DESeq2' which is more sensitive. 'limma' can be a fast and sensitive alternative, especially for large cohorts. Note: since the speedup of DESeq2 in version 1.26.0, we advise to use DESeq2 in any circumstance.
- gene_diff_method: Method used for gene differential testing (default: 'DESeq2' or 'limma-voom' if number of samples > 100). Possible choices are 'DESeq2' and 'limma-voom'. 'limma-voom' is a faster alternative for large cohorts.
- lib_type: Paired-end library type (default:
rf
). Specify eitherrf
for reverse-forward strand-specific libraries,fr
for strand-specific forward-reverse, orunstranded
for unstranded libraries. - output_dir: Location of DE-kupl results (default:
DEkupl_result
). - tmp_dir: Temporary directory to use (default:
./
aka current directory) - r1_suffix: Suffix to use for the FASTQ with left mate. Set
r2_suffix
for the second FASTQ. - dekupl_counter:
- min_recurrence: Minimum number of samples to support a k-mer (default: 10% of the size of the input condition with the less replicate).
- min_recurrence_abundance: Min abundance threshold to consider a k-mer in the reccurence filter (default: 5).
- diff_analysis:
- condition: Specify A and B conditions.
- pvalue_threshold: Min p-value (adjusted) to consider a k-mer as DE. Only DE k-mers are selected for assembly.
- log2fc_threshold: Min Log2 Fold Change to consider a k-mer as DE.
- samples: An array of samples. Each sample is described by a
name
and acondition
. The FASTQ files for a sample will be located using the following commandfastq_dir/sample_name_{1,2}.fastq.gz
. You can also provide a TSV file with your samples and conditions with the samples_tsv parameter (see below). - samples_tsv: A samples sheet in TSV format with at least a column 'name' with samples names and a column 'condition' with their associated conditions. This file must have a header line with the column names.
- ref_masking: A FASTA sequence file to be used for masking. All k-mers from these sequences will be deleted from further analysis. By default DEKupl-run uses the human Gencode 24 transcriptome for masking. To change this, add to the config.json file:
"ref_masking":transcriptome_masking.fa
- ref_kallisto: The reference transcriptome to be used for gene expression analysis by Kallisto. By default DEKupl-run uses the human Gencode 24 transcriptome. To change this, add to the config.json file:
"ref_kallisto":transciptome_kallisto.fa
- transcript_to_gene: A transcript-to-gene conversion table. A two column tabulated file, with the transcript ID in the first column and the gene ID in the second column. This file is not mandatory if the FASTA transcriptome given to Kallisto is from Gencode, were the gene ID can be extracted from the sequence names. An example of this file can be found here : tests/gencode.v24.transcripts.head1000.mapping.tsv.
- seed: Fixation of the seed for k-mer differential statistics. By default DEKupl-run fixes the variation due to the statistical method but it could add a quite overhead on the analysis (default: 'fixed'; possible choices are 'fixed' or 'not-fixed). Not useful for Ttest.
- masking: State of the masking step (default:
mask
). Setnomask
will skip the masking step.
For single-end libraries please specify the following parameters :
- lib_type: You can either set the lib_type to
single
in the case of single-end strand-specific library orunstranded
for single-end unstranded libraries. - fragment_length : The estimated fragment length (necessary for kallisto quantification). Default value is
200
. - fragment_standard_deviation : The estimated standard deviation of fragment length (necessary for kallisto quantification). Default value is
30
.
Notes :
The fastq files for single-end samples will be located using the following path : {fastq_dir}/{sample_name}.fastq.gz
If present, parameters r1_suffix and r2_suffix will be ignored.
The output directory of a DE-kupl run will have the following content :
├── {A}_vs_{B}_kmer_counts
│ ├── diff-counts.tsv.gz
│ ├── merged-diff-counts.tsv.gz
├── gene_expression
│ ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│ ├── normalization_factors.tsv
│ ├── raw-counts.tsv.gz
│ ├── noGENCODE-counts.tsv.gz
│ ├── {sample}.jf
│ ├── {sample}.txt.gz
│ ├── ...
├── metadata
│ ├── sample_conditions.tsv
│ ├── sample_conditions_full.tsv
The following table describes the output files produced by DE-kupl :
FileName | Description |
---|---|
diff-counts.tsv.gz |
Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES] . |
merged-diff-counts.tsv.gz |
Contains assembled k-mers from diff-counts.tsv.gz . Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES] . |
raw-counts.tsv.gz |
Containins raw k-mer counts of all libraries that have been filtered with the reccurence filters. |
noGENCODE-counts.tsv.gz |
Contains k-mer counts filtered from raw-counts.tsv with k-mers from the reference transcripts (ex: GENCODE by default). |
sample_conditions_full.tsv |
Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample |
Notes : For limma-voom in k-mer statistical method, meanA and meanB are in CPM (counts per million).
It is now possible to run DE-kupl-style analysis on whole-genome data, i.e. without using a reference transcriptome.
To do so, please change data_type
to WGS
in config.json
.
- if new samples are added to the config.json, make sure to remove the
metadata
folder in order to force SnakeMake to re-make all targets that depends on this file - Snakemake uses Rscript, not R. If a R module is not installed, type
which Rscript
andwhich R
and make sure they point to the same installation of R. - For OSX support you need to install the coreutils package with HomeBrew
brew install coreutils
. This package provide Linux versions of famous Unix command like "sort", "join", etc.