Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

ShuZishan · 2024-02-22T19:32:24Z

Why is the nt dataset downloaded from this link https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ larger [378GB] compared to the one downloaded using the command update_blastdb.pl --decompress nt [151GB]? Why are there differences between the two downloads? Could you provide details on the specific data that has been added or removed, and the reasons for these changes? I would greatly appreciate it.

The text was updated successfully, but these errors were encountered:

rse-lbl · 2024-03-11T20:57:15Z

You want to go to NCBI for that info since they set up both sets of data: https://www.ncbi.nlm.nih.gov/books/NBK62345/

nt.##.tar.gz The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded.

nt.gz The FASTA equivalent of the nt.##.tar.gz database files.

Search the page for "Getting the preformatted database files" for a description of the benefits of the files downloaded through update_blastdb.pl.

But here's the ultimate explanation for the file size discrepancy:

Preformatted database files remove the makeblastdb formatting steps, and saves valuable processing time and diskspace

If I understand correctly, the preformatted downloads are stored as presumably optimized binary databases instead of as plain text FASTAs.

qfduli · 2024-11-26T06:55:12Z

Why is the NT database said to require 150+ GB, is it an older version? The compressed package I downloaded is over 600 GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

ShuZishan commented Feb 22, 2024

rse-lbl commented Mar 11, 2024 •

edited

Loading

qfduli commented Nov 26, 2024

Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

Comments

ShuZishan commented Feb 22, 2024

rse-lbl commented Mar 11, 2024 • edited Loading

qfduli commented Nov 26, 2024

rse-lbl commented Mar 11, 2024 •

edited

Loading