Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Size of nt Dataset Downloads: Direct Link vs. update_blastdb.pl Command #90

Open
ShuZishan opened this issue Feb 22, 2024 · 2 comments

Comments

@ShuZishan
Copy link

Why is the nt dataset downloaded from this link https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ larger [378GB] compared to the one downloaded using the command update_blastdb.pl --decompress nt [151GB]? Why are there differences between the two downloads? Could you provide details on the specific data that has been added or removed, and the reasons for these changes? I would greatly appreciate it.

@rse-lbl
Copy link

rse-lbl commented Mar 11, 2024

You want to go to NCBI for that info since they set up both sets of data: https://www.ncbi.nlm.nih.gov/books/NBK62345/

nt.##.tar.gz The nucleotide sequence database contains entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, con, and environmental sequences are excluded. RefSeq genomic entries are also excluded.

nt.gz The FASTA equivalent of the nt.##.tar.gz database files.

Search the page for "Getting the preformatted database files" for a description of the benefits of the files downloaded through update_blastdb.pl.

But here's the ultimate explanation for the file size discrepancy:

Preformatted database files remove the makeblastdb formatting steps, and saves valuable processing time and diskspace

If I understand correctly, the preformatted downloads are stored as presumably optimized binary databases instead of as plain text FASTAs.

@qfduli
Copy link

qfduli commented Nov 26, 2024

Why is the NT database said to require 150+ GB, is it an older version? The compressed package I downloaded is over 600 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants