-
Notifications
You must be signed in to change notification settings - Fork 1
3.4 Examining Assemblies from SRA Normalized and Lite Format Files
For convenience, we have pre-assembled two records in both Normalized and Lite formats, the results of which can be retrieved from the workshop github using the commands below. We also included an example of the assembly command used.
SKESA can take a bare accession and retrieve it for you, but here we specify the input files explicitly.
~/SKESA/skesa --reads SRR9854072_1.fastq,SRR9854072_2.fastq --contigs_out SRR9854072_etl_assembly.fasta --use_paired_ends --memory 70 --cores 8
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR9854072_etl_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR9854072_lite_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR17393369_etl_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR17393369_lite_assembly.fasta
An easy and straight-forward way to check if there are significant differences in the contigs coming from Normalized vs Lite files is to blast them against each other. To do this we are using the NCBI Blast Docker image to create a blast database out of one set of contigs, then blast the other set against that database with settings that will return the best match for each query sequence along with statistics such as the query and subject sequence lengths, the match length, and the number of identical matches.
sudo docker run -v $PWD:$PWD:rw -w $PWD --rm ncbi/blast makeblastdb -in $PWD/SRR9854072_etl_assembly.fasta -dbtype nucl -out $PWD/SRR9854072_etl_assembly_bastdb
sudo docker run -v $PWD:$PWD:rw -w $PWD --rm ncbi/blast blastn -db $PWD/SRR9854072_etl_assembly_bastdb -query $PWD/SRR9854072_lite_assembly.fasta -max_target_seqs 1 -outfmt "6 qacc sacc length qlen slen nident" -max_hsps 1 -num_threads 4 -out $PWD/SRR9854072_etl_v_lite_blast_results.tsv
head -n35 SRR9854072_etl_v_lite_blast_results.tsv
As Skesa does not make use of quality scores, differences in Normalized and Lite would not be expected. We did not apply a filtering step here, so the identity between contigs assembled from Normalized and Lite formats is unsurprising. We can use a tool like fastqc
to assess if we should have filtered out any reads, and if reads would have been similarly filtered from both file formats.
fastqc *.fastq
The resulting html files can then be downloaded from the VM console window and viewed in your browser
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health