Can not download the whole reference genomes. #32

Lily-WL · 2020-07-07T01:36:25Z

Dear Developers,

When I download the reference genome using the command "phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log" , it usually stopped dowloading before finished.

Downloading file of size: 1.16 MB
1.16 MB 100.06 % 0.48 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYC2_FULL_42_11
Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/822/985/GCA_001822985.1_ASM182298v1/GCA_001822985.1_ASM182298v1_genomic.fna.gz" to "input_genomes/GCA_001822985.fna.gz"
Downloading file of size: 0.23 MB
0.23 MB 100.10 % 0.17 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYD1_FULL_43_11
Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/823/015/GCA_001823015.1_ASM182301v1/GCA_001823015.1_ASM182301v1_genomic.fna.gz" to "input_genomes/GCA_001823015.fna.gz"
Downloading file of size: 0.19 MB
0.05 MB 24.20 % 0.04 MB/sec 0 min 4 sec

I do not know if it is becasue the connection with ncbi stopped or other reason. How can I do for that?

The text was updated successfully, but these errors were encountered:

fasnicar · 2020-07-07T09:49:48Z

Hi, I think it might be due to some connection instability.
I re-run the command you posted this morning and it is still running (downloading genomes from NCBI).
Are you able to try with a different Internet connection?

Thanks,
Francesco

Lily-WL · 2020-07-08T01:05:19Z

Dear Francesco,

Thank you very much for your reply! I tried many times, the condition is similar. So I think if I can download the genome one by one for the remained. Is it possible to have the list of reference genomes? Does all the ones from "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA" is needed?

fasnicar · 2020-07-08T12:14:24Z

Yes, you can play a bit with bash and the files downloaded by PhyloPhlAn at the beginning: taxa2genomes_cpa0.2_up201804.txt.bz2 and assembly_summary_genbank.txt.

For each line in taxa2genomes_cpa0.2_up201804.txt.bz2 you should consider the first item of the list (; separated) of the third field (TAB separated)
The ID from the previous step is in the form GCA_001905625.1, you should split it on the . and keep only the first part (i.e., GCA_001905625)
Then you should get the ftp_path from the assembly_summary_genbank.txt that matches the previous ID to get the URL for downloading
from the URL retrieved from the assembly_summary_genbank.txt file, you should replace ftp:// with https:// and append _genomic.fna.gz to the end

Lily-WL · 2020-07-09T08:35:19Z

Thank you very much for your reply. In order to download the remained large number of genomes, can I revise the file "taxa2genomes_cpa0.2_up201804.txt.bz2" in which the downloaded genome information were cut? But it can't work.

(python3.7) [wl@ts-rd350 Phylophlan]$ phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log
phylophlan_get_reference.py version 3.0.16 (8 May 2020)

Command line: /home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose

Arguments: {'get': 'all', 'list_clades': False, 'database_update': False, 'output_file_extension': '.fna.gz', 'output': 'input_genomes/', 'how_many': 1, 'genbank_mapping': 'assembly_summary_genbank.txt', 'verbose': True}
File "taxa2genomes.txt" present
File "taxa2genomes_cpa0.2_up201804.txt.bz2" present
Output folder "input_genomes/" present
File "assembly_summary_genbank.txt" present
Traceback (most recent call last):
File "/home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference", line 10, in
sys.exit(phylophlan_get_reference())
File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 313, in phylophlan_get_reference
args.output_file_extension, args.output, args.database_update, verbose=args.verbose)
File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 274, in get_reference_genomes
if (taxa_label in r_clean[1].split('|')) or (taxa_label == 'all'):
IndexError: list index out of range

fasnicar self-assigned this Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not download the whole reference genomes. #32

Can not download the whole reference genomes. #32

Lily-WL commented Jul 7, 2020

fasnicar commented Jul 7, 2020

Lily-WL commented Jul 8, 2020

fasnicar commented Jul 8, 2020

Lily-WL commented Jul 9, 2020 •

edited

Loading

Can not download the whole reference genomes. #32

Can not download the whole reference genomes. #32

Comments

Lily-WL commented Jul 7, 2020

fasnicar commented Jul 7, 2020

Lily-WL commented Jul 8, 2020

fasnicar commented Jul 8, 2020

Lily-WL commented Jul 9, 2020 • edited Loading

Lily-WL commented Jul 9, 2020 •

edited

Loading