Irisas (Integrated region-based variant synchronization and Annotation for association studies)
This pipeline includes two modules:
*synchronize INDEL and other variants for GWAS and population genomics analysis.
*generate variants tables of multiple variants integrating effect on ORF level for GWAS aim.
Gaps between the variance explained by identified SNP QTLs and estimated heritability suggested there are missing causal variants. It has been attracting a lot of attentions about how to uncover missing causal variants. We developed this pipeline to uncover causal INDEL and integrating effects of multiple variants.
INDEL could cause phenotype diversity. The causal role of INDEL might could not be completely uncovered by SNPs, due to non-perfect linkage disequilibrium (LD). So performing association analysis, taking INDEL as independent variable, might could uncover novel causal variants.
For NGS data analysis, when aligning a divergent sequence to a reference genome, alignment isomorphs usually occur, where the essentially same sequence could be aligned in different ways. It is very important to synchronize those INDELs to take them as independent variables for association analysis.
The protein coding region of genome sequence would be translated into functional protein molecular through several key steps. Any mutation that alter the transcription and translation are generally result into no final protein product or loss of function protein product. This kind of mutation is suspected to cause phenotype diversity. By integrating the effects of all variants in the gene region, we could predicat weather a gene could code functional protein. Taking this predication as independent variables, some causal genes could be uncovered.
You could go to our wiki page for more details about the underlying algorithm.
imrdenom
wigToBigWig
mafft
Exonerate
samtools
PLINK
Make sure you have JDK>=1.8 and ant installed on your computer.
git clone https://github.com/baoxingsong/Irisas.git
cd Irisas
ant
You will get Irisas.jar
under folder dist
There are two main modules in this packages. INDEL and other variant synchronization and integrating effects.
java -jar Irisas.jar
Commands:
-- One command for all the steps
EasyRun run all the steps with single command
-- Functions for whole genome wide variant synchronization
CutTheWholeGenomeWithWindow cut the whole genome sequence with a window
NewSdiFromMsa generate sdi files from MSA results
SdiToSnpPlink generate PLINK files of SNP
SdiToIndelPlink generate PLINK files of INDEL
-- Functions for integrating effect
ExtractCdsSequenceAndCheckFunc extract CDS&protein sequence
ExtractGenomeSequce extract the genome sequence of each transcript
GenerateLofPed generate PLINK files of integrating effect
-- Several advanced testing functions (under testing)
SdiSnpToPedMultipleAllic generate tped files of SNP with multiallelic enable
SdiIndelToPedMultipleAllic generate tped files of INDEL with multiallelic enable
SdiIndelToPedWithDecomposition generate tped files of SNP/INDEL with multiallelic enable
IndelSnpPlinkFromMsa generate tped files of SNP/INDEL, taking MSA result as input. with multiallelic enable
IndelPlinkFromMsa generate tped files of INDEL, taking MSA result as input. with multiallelic enable
java -Xmx50g -jar Irisas.jar EasyRun
-t [integer] thread number. (Default 5)
-w [integer] window size. (Default 10000)
-v [integer] overlap size between two neighbour windows (Default 500)
-r name of reference accession/line
-l list of accession names
-g the folder where the genome sequences and sdi files are located
-n [integer] number of accession to process for each batch, should be larger than thread number (Default 50)
-a reference gene structure annotation in GFF/GTF format
-i [integer] extend interval (Default 500)
*The overlap size between neighbour windows is twice the parameter -v
Current this pipeline takes reference genome sequence in fasta format, reads mapping in bam format and variant calling result in sdi format as input. If your variant calling results are in vcf format, you could easily reformat them into sdi format. Assume you have a vcf download from 1001 Arabidopsis thaliana website (http://1001genomes.org/data/GMI-MPI/releases/v3.1/intersection_snp_short_indel_vcf/intersection_10001.vcf.gz). After gzip, you could run this command to get sdi file.
cat intersection_10001.vcf | grep -v "#" | awk '{print "Chr"$1"\t"$2"\t"length($5)-length($4)"\t"$4"\t"$5}' > 10001.sdi
Get the pseudogenome sequence for each accession/line with imrdenom.
mcmerge getgenome -o <imr.fa> <ref> <imr.sdi>
Reformat bam files into bigwig files with the following command:
samtools mpileup 10001.bam | perl -ne 'BEGIN{print "track type=wiggle_0 name=10001 description=10001\n"};($c, $start, undef, $depth) = split; if ($c ne $lastC) { print "variableStep chrom=$c\n"; };$lastC=$c; print "$start\t$depth\n";' > 10001.wig
wigToBigWig 10001.wig chrSize 10001.bw
rm 10001.wig
Put all the sdi files, pseudogenome sequence, bigwig files and reference genome sequence under the same folder. The bigwig files and pseudogenome sequence files should be with the same prefix name with sdi file. The genome sequence should take .fa as postfix. And bigwig file should take bw as postfix.
Prepare a text file with each accession name as one line. And prepare a file with each chromosome name as one line.
There are several steps to perform variant synchronization. Firstly, you should cut the whole-genome sequence into smaller fragments, perform multiple sequence alignment with MAFFT or any multiple sequence alignment software you like. And assembly those MSA results together and perform variant calling.
java -jar Irisas.jar CutTheWholeGenomeWithWindow
-t [integer] thread number. (Default 5)
-w [integer] window size. (Default 10000)
-v [integer] overlap size between two neighbour window (Default 500)
-o output folder
-r name of reference accession/line
-l list of accession names
-g the folder where the genome sequences and sdi files are located
java -jar Irisas.jar NewSdiFromMsa
-t [integer] thread number. (Default 5)
-n [integer] number of accession to process for each batch, should be larger than thread number (Default 50)
-i input folder where could find MSA result
-l list of accession names
-r name of reference accession/line
-c list of chromosome names
-o output folder
-g the folder where the genome sequences and sdi files are located
*For RAM (random-access memory) saving purpose, this software does not read all the multiple sequence alignment result into RAM.
-n gives the number of accessions to read into RAM each time. And -n should be equal or larger than -t.
The above part is the pipeline to generate synchronized variants table for each individual accession/line. To feed sdi files into GWAS software, functions were developed to transform them into PLINK format.
java -jar Irisas.jar SdiToSnpPlink
-t [integer] thread number (Default 5)
-l list of accession names
-r name of reference accession/line
-g the folder where the genome sequences and bw files are located
-s the folder where sdi files are located
*Triple alleles were filtered. INDELs overlap with the encoding SNP, alleles with reads mapping coverage less than 2 and any potential heterozygous alleles were encoded as the missing value.
java -jar Irisas.jar SdiToIndelPlink
-t [integer] thread number. (Default 5)
-l list of accession names
-s the folder where sdi files are located
*For any INDEL positionally overlap with the currently encoding INDEL would be encoded as the missing value.
java -jar Irisas.jar ExtractCdsSequenceAndCheckFunc
-r reference genome sequence in fasta format
-a reference gene structure annotation in GFF/GTF format
-o output CDS sequence for each transcript
-p output protein sequence for each transcript
java -jar Irisas.jar ExtractGenomeSequce
-m genome sequence of target accession/line in fasta format
-s sdi file of target accession/line
-a reference gene structure annotation in GFF/GTF format
-o output folder for the current accession/line. (it is called exonerate working folder)
-i [integer] extend interval (Default 1000)
java -jar Irisas.jar GenerateLofPed
-a reference gene structure annotation in GFF/GTF format
-f exonerate working folder for this accession/line
-s sdi file of target accession/line
-m genome sequence of target accession/line in fasta format
-r reference genome sequence in fasta format
*ORF-state disturbed allele be encoded as 2 and ORF-state conserved allele encoded as 1.
Irisas is designed and tested for inbreeding lines.
Song B, Mott R, Gan X (2018) Recovery of novel association loci in Arabidopsis thaliana and Drosophila melanogaster through leveraging INDELs association and integrated burden test. PLoS Genet 14(10): e1007699.