sQTLseekeR is a R package to detect splicing QTLs (sQTLs), which are variants associated with change in the splicing pattern of a gene. Here, splicing patterns are modeled by the relative expression of the transcripts of a gene.
For more information about the method and performance see article : Monlong, J. et al. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5:4698 doi: 10.1038/ncomms5698 (2014).
To install the latest development version: devtools::install_github("jmonlong/sQTLseekeR")
.
This requires devtools
package (more information here)
which can be installed with install.packages("devtools")
.
Some Bioconductor packages are also required. They can be installed with :
source("http://bioconductor.org/biocLite.R")
biocLite(c("Rsamtools", "qvalue"))
R 3.1 or higher is required.
The first step is to prepare the input data. sQTLseekeR
requires three inputs:
- transcript expression. Column trId and geneId, corresponding to the transcript and gene ID are required. Then each column represents a sample and is filled with the expression values. Relative expression will be used hence both read counts or RPKMs works as the expression measure. However, it is not recommended to use transformed (log, quantile, or any non-linear transformation) counts or RPKMs because Hellinger distance is suited for relative expression.
- gene location information. In a BED-like format, the location of each gene is explicitly defined in this file.
- genotype information. The genotype of each sample is coded as follow: 0 for ref/ref; 1 for ref/mutated; 2 for mutated/mutated; -1 for missing value. Furthermore the first four columns should gather information about the SNP: chr, start, end and snpId. Finally this file needs to be ordered per chr and start position.
When all input files are correctly formatted sQTLseekeR
prepares the data through functions prepare.trans.exp
and index.genotype
.
prepare.trans.exp
will :- remove transcripts with low expression.
- remove genes with less than two expressed transcript.
- remove genes with low splicing dispersion.
- remove genes with not enough different splicing patterns.
- flag samples with low gene expression.
index.genotype
compresses and indexes the genotype file to optimize further accession of particular regions. Note that the input file should not be compressed.
Once the input files are ready, sqtl.seeker
function will compute the P-values for each pair of gene/SNP testing the association between the genotype and transcript relative expression. Here is a quick description of the parameters that would most likely be tweaked:
genic.window
the window(bp) around the gene in which the SNPs are tested. Default is 5000 (i.e. 5kb).svQTL
should svQTLs test be performed in addition to sQTLs (default is FALSE). svQTLs are used to identify potential false positive among the significant sQTLs. svQTLs represents situation where the variance in transcript relative expression is different between genotype groups. In this particular situation identification of sQTLs is less robust as we assume homogeneity of the variance between groups, hence it might be safer to remove svQTLs from the list of reported sQTLs. However computation of svQTLs cannot rely on an asymptotic approximation, hence the heavy permutations will considerably increase the running time.nb.perm.max
the maximum number of permutation/simulation to compute the P-value. The higher this number, the lower the P-values can potentially get but the longer the computation.
Finally, function sqtls
is used to retrieve significant associations. The user can manually define a false discovery rate(FDR) or perform further filtering afterwards. Of note, there is a separate FDR threshold for svQTL removal (if svQTLs were computed), which is usually preferred to stay low (e.g. around 0.01).
An example of an analysis can be found in folder scripts
.
sQTLseekeR
can be used on a cluster using package BatchJobs
. An example of an analysis using BatchJobs
can be found in folder scripts
.
BatchJobs
is a potent package but basic functions are enough in our situation. Here is a quick practical summary of BatchJobs
commands used in the script:
makeRegistry
creates a registry used to manipulate jobs for a particular analysis step.batchMap
adds jobs to a registry. Simply, the user gives a function and a list of parameters. One job per parameter will be created to compute the output of the function using this specific parameter.submitJobs
submits the jobs to the cluster. This is where the queue, maximum computation time, number of core can be specified. Moreover, if needed, a subset of the jobs can be sent to the cluster. FunctionsfindNotDone
andfindErrors
are particularly useful to find which jobs didn't finish or were lost in the limbo of the cluster management process.showStatus
outputs the status of the computations.loadResult
retrieves the output of one specific job, whilereduceResultsList
retrieves output for all jobs into a list format.
Another important point about BatchJobs
is its configuration for the computing cluster. An example of the configuration files can be found in the scripts
folder:
- If present in the working directory,
.BatchJobs.R
is loaded when theBatchJobs
package is loaded. It defines which template to use andBatchJobs
functions. In practice, it loads another R script file (heremakeClusterFunctionsAdaptive.R
) with the functions to use. In.BatchJobs.R
users would only need to change the email address where to send the log messages to. - In
makeClusterFunctionsAdaptive.R
, users just need to check/replaceqsub
/qdel
/qstat
calls with the correct bash commands (sometimesmsub
/canceljob
/showq
). This file should also be in the working directory whenBatchJobs
is loaded. - Finally
cluster.tmpl
is a template form of a job bash script that would be send to the cluster. There the correct syntax for the resources or parameters of the cluster are defined. This file should also be in the working directory whenBatchJobs
is loaded.
An output of NULL
means there were no gene/SNPs to analyze. It is likely due to inconsistent input files. To debug check that :
- Gene IDs in the transcript expression and gene location are similar.
- Genomic coordinates in the gene location and genotype information are consistent. E.g. 1 vs chr1.
- Sample names in the transcript expression and genotype are similar.
If the input files are fine, an output of NULL
might be caused by inappropriate transcript expression (e.g. genes with low expression) or genotypes (e.g. many missing values).
svQTLs are SNPs associated with splicing variability of a gene. Here the relative transcript expression wight be globally similar between genotype groups but much more variable in specific one. Although the biological interpretation is not straightforward, we use them to flag potential false sQTLs. Indeed, the test for differential transcript relative expression assumes similar variability between genotype groups. Hence if a conservative approach to find sQTLs would be to retrieve significant sQTLs that are not svQTLs.
By default, sQTLseekeR tests association between a gene and SNPs within or close by (defined by genic.window
parameter). Testing association between all SNPs and all genes is not feasible : it would require too much computation and we don't see a good multiple-testing correction for this design that would fit our permuted/simulated P-values.
However, the user can specify exactly which regions should be tested for each gene. Instead of using the gene location for gene.loc
parameter, the user can feed the locations of the regions to test. Similarly to the gene location, chr, start, end and geneId columns define the region and the gene to test. Several regions can be defined for a same gene. Of note, in this design, genic.window=0
could be used to ensure that no flanking regions are added to the regions to test.