-
Notifications
You must be signed in to change notification settings - Fork 103
Tutorial: Iso Seq Bioinfx hands on
Last Updated: 01/19/2022
Tutorial data can be downloaded here
Official IsoSeq(3) documentation.
conda activate isoseqtoy
cd ~/workshop_data/isoseq3/
cat run.sh
Cheat Sheet:
lima --isoseq --dump-clips --peek-guess -j 8 alz.ccs.bam isoseq_primers.fasta alz.demult.bam
isoseq3 refine --require-polya alz.demult.5p--3p.bam isoseq_primers.fasta alz.flnc.bam -j 8
isoseq3 cluster alz.flnc.bam alz.polished.bam --verbose --use-qvs -j 8
pbmm2 align ../hg38.mmi alz.polished.hq.bam alz.aligned.bam --preset ISOSEQ --sort -j 8
isoseq3 collapse alz.aligned.bam alz.collapsed.gff
If you need to remove a file or symbolic link, use rm <filename>
. But be careful that if you remove a symbolic link - the actual file is still there. If you remove a physical file, the file is gone.
Use cat run.sh
to look at the commands we are gonna run.
NOTE: there is one typo in run.sh
that will cause a command to fail. Can you spot which one?
Practice 1.1. Check the isoseq3 version in your environment
isoseq3 --version
Practice 1.2. Run through isoseq3 steps, starting from lima
to isoseq3 cluster
output.
Practice 1.3. Learn to use pbmm2
and isoseq3 collapse
to align isoseq3 cluster output (HQ isoforms) to the reference genome, collapse to produce the GFF output. Find the section in SMRT Tools documentation that describes how to do this.
Practice 1.4. Learn to visualize the GFF output using IGV and UCSC Genome Browser.
- Can you change the track name?
- Can you change the track colors?
- Can you navigate to a specific gene (ex: USP11)
NOTE: you may need to delete the comments (the beginning lines starting with #
in the GFF file) for IGV to show the exon chain properly. It'll work just fine in UCSC without modification.
Remember to deactivate the previous conda env and load Cupcake module
conda activate SQANTI3.env
cd ~/workshop_data/isoseq3/
Practice 2.1. Practice reverse complementing sequences using revcomp.py
revcomp.py ATGGGG
revcomp.py CGAGAATT
Practice 2.2. Practice getting sequence length statistics using get_seq_stats.py. What is different about the three commands below? What are they trying to do?
get_seq_stats.py --help
get_seq_stats.py alz.collapsed.fasta
get_seq_stats.py -b 100 alz.collapsed.fasta
Practice 2.3. Practice using get_seqs_from_list by making a sequence ID list you want to extract
We will practice by extracting the first 5 sequences from the isoseq3 output HQ fasta file.
gunzip alz.polished.hq.fasta.gz
get_seqs_from_list.py --help
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//' | sed 's/>//' > ids.txt
get_seqs_from_list.py alz.polished.hq.fasta ids.txt | more
Try running the complicated grep
line one segment at a time - what do you get when you do the commands below?
grep ">" alz.polished.hq.fasta
grep ">" alz.polished.hq.fasta | head -n 5
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//'
grep ">" alz.polished.hq.fasta | head -n 5 | sed 's/ .*//' | sed 's/>//'
cat ids.txt
Practice 2.4. Practice converting between fasta <-> fastq using fa2fq.py and fq2fa.py
cp isoseq_primers.fasta test.fasta
fa2fq.py test.fasta
cat test.fastq
fq2fa.py test.fastq
cat test.fasta
(BONUS) Practice 2.5. Alternative way to get collapsed GFFs - using collapse_isoforms_by_sam.py. Use the same dataset from Practice 1.3 to practice this.
NOTE: the output GFF can differ slightly due to implementation differences between Cupcake collapse and isoseq3 collapse.
You can check that you have Cupcake installed and the $PATH
variable set correctly by:
$ which collapse_isoforms_by_sam.py
<home_dir>/mylib/bin/collapse_isoforms_by_sam.py
conda activate SQANTI3.env
cd ~/workshop_data/Cogent
Practice 3.1. Check Cogent version by typing reconstruct_contig.py --version
Practice 3.2. Run gene family finding. How many input sequences are there? How many partitions (gene families) is produced?
You can just run the two commands in run.sh
for this.
Practice 3.3. Run coding genome reconstruction. How many reconstructed contigs are there for each partition?
*NOTE: run.sh does not contain the commands to do this, you will need to figure that out yourself! *
HINT: it will involve the following commands: cd <SOMETHING_SOMETHING>
to get you into a directory,
then reconstruct_contig.py .
to run the results
(BONUS) Practice 3.4. Align both the input (in.trimmed.fa
) and output (cogent2.fa
) to the ref genome hg38. Convert the SAM file either to BAM (using samtools) or GFF (using sam_to_collapsed_gff.py
from Cupcake) and visualize it in IGV or UCSC Genome Browser.
Official isoseq3 UMI/single cell documentation. Also refer to the Cupcake single-cell wiki.
conda activate isoseqtoy
cd ~/workshop_data/singlecell
You can follow along the single cell example
conda activate SQANTI3.env
cd ~/workshop_data/SQANTI3
cat run.sh