3.4 Examining Assemblies from SRA Normalized and Lite Format Files

Project 2: Compare assemblies of SRA Normalized and SRA Lite records

Step 1: Assemble Records using SKESA

For convenience, we have pre-assembled two records in both Normalized and Lite formats, the results of which can be retrieved from the workshop github using the commands below. We also included an example of the assembly command used.

Example assembly command

SKESA can take a bare accession and retrieve it for you, but here we specify the input files explicitly.

~/SKESA/skesa --reads SRR9854072_1.fastq,SRR9854072_2.fastq --contigs_out SRR9854072_etl_assembly.fasta --use_paired_ends --memory 70 --cores 8

Retrieve the assemblies from Github

wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR9854072_etl_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR9854072_lite_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR17393369_etl_assembly.fasta
wget https://raw.githubusercontent.com/ncbi/workshop-asm-ngs-2024/refs/heads/main/assemblies/SRR17393369_lite_assembly.fasta

Step 2: Compare Results for Lite and Normalized File Formats

An easy and straight-forward way to check if there are significant differences in the contigs coming from Normalized vs Lite files is to blast them against each other. To do this we are using the NCBI Blast Docker image to create a blast database out of one set of contigs, then blast the other set against that database with settings that will return the best match for each query sequence along with statistics such as the query and subject sequence lengths, the match length, and the number of identical matches.

sudo docker run -v $PWD:$PWD:rw -w $PWD --rm ncbi/blast makeblastdb -in $PWD/SRR9854072_etl_assembly.fasta -dbtype nucl -out $PWD/SRR9854072_etl_assembly_bastdb

sudo docker run -v $PWD:$PWD:rw -w $PWD --rm ncbi/blast blastn -db $PWD/SRR9854072_etl_assembly_bastdb -query $PWD/SRR9854072_lite_assembly.fasta -max_target_seqs 1 -outfmt "6 qacc sacc length qlen slen nident" -max_hsps 1 -num_threads 4 -out $PWD/SRR9854072_etl_v_lite_blast_results.tsv

head -n35 SRR9854072_etl_v_lite_blast_results.tsv

Step 3: Compare Quality Score Distributions

As Skesa does not make use of quality scores, differences in Normalized and Lite would not be expected. We did not apply a filtering step here, so the identity between contigs assembled from Normalized and Lite formats is unsurprising. We can use a tool like fastqc to assess if we should have filtered out any reads, and if reads would have been similarly filtered from both file formats.

fastqc *.fastq

The resulting html files can then be downloaded from the VM console window and viewed in your browser

fastqc-html-example

This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health

Provide feedback

Saved searches

Use saved searches to filter your results more quickly