-
Notifications
You must be signed in to change notification settings - Fork 103
IsoPhase: Haplotyping using Iso Seq data
Last Updated: 11/12/2020
- Prerequisite
- Install IsoPhase
- What you will need for phasing
- Running IsoPhase
- Summarizing IsoPhase output
The following tools are required to run IsoPhase.
Follow the install steps for installing Cupcake. I will be using the Anaconda environment for the remainder of this tutorial.
# activate anaconda environment, here I assume it's called anaCogent
$ source activate anaCogent
(anaCogent)-bash-4.1$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
(anaCogent)-bash-4.1$ cd cDNA_Cupcake
(anaCogent)-bash-4.1$ python setup.py build
(anaCogent)-bash-4.1$ python setup.py install
- Reference genome, in fasta format
- Full-length (CCS/HiFi) reads, in fastq format (best if filtered for accuracy!, see below)
- Annotation GFF file using Iso-Seq data, where each isoform has ID format
PB.X.Y
- Association file linking individual full-length reads to isoforms
PB.X.Y
This is the reference genome that is associated with the gene/isoform annotations. It must be in fasta format.
This is the full-length (CCS/HiFi) reads. It must be in fastq format.
When you run Iso-Seq 3 through either Bioconda or SMRT Link, it will generate the full-length, non-concatemer read file, usually with the filename unpolished.flnc.bam
. This BAM file can be converted to FASTQ using:
bamtools convert -format fastq -in unpolished.flnc.bam > flnc.fastq
The GFF files should have a field transcript_id
where the ID is in PB.X.Y
format where X
is the gene locus and Y
is the isoform index.
For example:
chr1 PacBio transcript 226419 242943 . - . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 226419 229394 . - . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 232706 232874 . - . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 235860 236071 . - . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio exon 242773 242943 . - . gene_id "PB.1"; transcript_id "PB.1.1";
chr1 PacBio transcript 229163 232874 . - . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio exon 229163 229394 . - . gene_id "PB.1"; transcript_id "PB.1.2";
chr1 PacBio exon 232706 232874 . - . gene_id "PB.1"; transcript_id "PB.1.2";
Here, PB.1.1
and PB.1.2
are two isoforms of the same gene (PB.1
). It is important to know which isoforms belong to the same genes since IsoPhase combines all FL reads from the same gene for SNP calling and phasing.
The association file must have at least two fields, id
and pbid
where id
is the full-length (CCS/HiFi) read ID and pbid
corresponds to the unique isoforms from the annotation GFF file.
For example:
id length is_fl stat pbid
m54033_171101_101105/52102116/ccs 4369 Y unique PB.1.1
m54033_171101_101105/15073981/ccs 4468 Y unique PB.1.1
m54033_171101_035054/42533716/ccs 4392 Y unique PB.1.1
m54033_171101_101105/21889610/ccs 4374 Y unique PB.1.2
m54033_171101_101105/6422907/ccs 4366 Y unique PB.1.2
m54033_171101_035054/15860654/ccs 4426 Y unique PB.1.2
NOTE: IsoPhase currently only calls substitution SNPs. It also will only call SNPs at positions with > 40-fold coverage.
First we will generate a by_loci
directory where each by_loci/PB.X_sizeN
contains information about the locus PB.X
and N
is the number of full-length reads associated with transcript isoforms PB.X.Y
.
usage: select_loci_to_phase.py [-h] [-c COVERAGE]
genome_fasta flnc_filename gff_filename
stat_filename
positional arguments:
genome_fasta Reference genome fasta
flnc_filename FLNC fastq file
gff_filename GFF file of transcripts, IDs must be PB.X.Y
stat_filename Tab-delimited read stat file linking FLNC to PB.X.Y
-c COVERAGE, --coverage COVERAGE
Minimum FLNC coverage required (default: 40)
For example, we will run
python <path_to>/cDNA_Cupcake/phasing/utils/select_loci_to_phase.py \
genome.fasta isoseq_flnc.fastq \
hq.fastq.no5merge.collapsed.filtered.gff \
hq.fastq.no5merge.collapsed.read_stat.renamed.txt \
-c 40
We first modify the template bash script run_phasing_in_dir.sh
by making a local copy and editing it.
cp <path_to>/cDNA_Cupcake/phasing/utils/run_phasing_in_dir.sh .
Change PLOIDY
to suit your data. PLODIY=2
means diploid and PLOIDY=4
means tetraploid, for example.
MIN_BQ
is the minimum QV used by IsoPhase for SNP calling.
PLOIDY=2
MIN_BQ=13
Optionally, you can uncomment and modify the following section to run GMAP (or minimap2) against the refernce genome. This will align the FL reads for each locus to the reference genome and generate an aligned, sorted BAM file.
#fq2fa.py ccs.fastq # this can be removed in later
#gmap -D ~/share/gmap_db_new/ -d maize4 -f samse -n 0 -t 4 ccs.fasta > ccs.maize4.sam
#samtools view -bS ccs.maize4.sam > ccs.maize4.bam
#samtools sort ccs.maize4.bam > ccs.maize4.sorted.bam
#samtools index ccs.maize4.sorted.bam
Next, we will run the command to generate a per-locus bash script. After the command below, there should be a run.sh
in each by_loci/PB.X_sizeN
subdirectory.
ls -1d by_loci/*size*|xargs -n1 -i echo "bash run_phasing_in_dir.sh {}" > cmd
bash cmd
We can now run each of the by_loci/PB.X_sizeN/run.sh
. Below is an example command to go through each subdirectory and generate the command used for it.
ls -1d by_loci/*size*|xargs -n1 -i echo "cd {}; bash run.sh; cd ../../" > cmd2
bash cmd2
You can split up cmd2
into chunks to run each of them in parallel.
The two VCF output files in each subdirectory are:
phased.nopartial.cleaned.vcf # phasing using reads that cover every SNP base
phased.partial.cleaned.vcf # phasing using reads that cover at least one SNP base
With phased.nopartial.cleaned.vcf
, the number of FL reads that cover every SNP base may be significantly reduced (since alternative splicing might mean some reads skip certain exons that contain SNPs) and the result is there may be fewer SNPs called and phasing may not be successful.
With phased.partial.cleaned.vcf
, reads that do not have every SNP base called have to be inferenced. When there are very few SNPs, this may result in some reads being assigned to the wrong haplotypes. Users are advised to look at both VCF output to determine the quality of the results.
You can summarize the phasing results using:
python ~/GitHub/cDNA_Cupcake/phasing/utils/summarize_byloci_results.py
This generates a summary file called summarized.isophase_results.txt
which shows the number of SNPs called (num_snp
), the number of alleles inferenced with only reads that cover every SNP (num_hap_nopartial
) and all reads (num_hap_withpartial
).
locus size num_snp num_hap_nopartial num_hap_withpartial
by_loci/PB.12426_size82 82 5 2 2
by_loci/PB.21892_size59 59 0 0 0
by_loci/PB.21890_size531 531 2 2 2
You can also collect all the VCFs from each separate directory into a single file using:
python ~/GitHub/cDNA_Cupcake/phasing/utils/collect_all_vcf.py
which generates a summarized SNP VCF file IsoSeq_IsoPhase.vcf
. NOTE however, that this file only contains SNP information - but not isoform information, since that is specific to each gene locus.
You can further decide which directories (loci) to tally and which VCF output file to use:
usage: collect_all_vcf.py [-h] [-d DIR] [-o OUTPUT]
[--vcf {phased.partial.vcf,phased.partial.cleaned.vcf,phased.nopartial.vcf,phased.nopartial.cleaned.vcf}]
[-s SELECT_DIRS]
optional arguments:
-h, --help show this help message and exit
-d DIR, --dir DIR Directory containing the subdirs of IsoPhase (default:
by_loci/)
-o OUTPUT, --output OUTPUT
Output VCF filename (default: IsoSeq_IsoPhase.vcf)
--vcf {phased.partial.vcf,phased.partial.cleaned.vcf,phased.nopartial.vcf,phased.nopartial.cleaned.vcf}
VCF to use per directory (default:
phased.partial.cleaned.vcf)
-s SELECT_DIRS, --select_dirs SELECT_DIRS
Comma separate list of directories to tally - if this
is used, <dir> is ignored