Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not download the whole reference genomes. #32

Open
Lily-WL opened this issue Jul 7, 2020 · 4 comments
Open

Can not download the whole reference genomes. #32

Lily-WL opened this issue Jul 7, 2020 · 4 comments
Assignees

Comments

@Lily-WL
Copy link

Lily-WL commented Jul 7, 2020

Dear Developers,

When I download the reference genome using the command "phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log" , it usually stopped dowloading before finished.

Downloading file of size: 1.16 MB
1.16 MB 100.06 % 0.48 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYC2_FULL_42_11
Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/822/985/GCA_001822985.1_ASM182298v1/GCA_001822985.1_ASM182298v1_genomic.fna.gz" to "input_genomes/GCA_001822985.fna.gz"
Downloading file of size: 0.23 MB
0.23 MB 100.10 % 0.17 MB/sec 0 min -0 sec
Downloading 1 reference genomes for k__Bacteria|p__Candidatus_Veblenbacteria|c__Candidatus_Veblenbacteria_unclassified|o__Candidatus_Veblenbacteria_unclassified|f__Candidatus_Veblenbacteria_unclassified|g__Candidatus_Veblenbacteria_unclassified|s__Candidatus_Veblenbacteria_bacterium_RIFOXYD1_FULL_43_11
Downloading "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/823/015/GCA_001823015.1_ASM182301v1/GCA_001823015.1_ASM182301v1_genomic.fna.gz" to "input_genomes/GCA_001823015.fna.gz"
Downloading file of size: 0.19 MB
0.05 MB 24.20 % 0.04 MB/sec 0 min 4 sec

I do not know if it is becasue the connection with ncbi stopped or other reason. How can I do for that?

@fasnicar fasnicar self-assigned this Jul 7, 2020
@fasnicar
Copy link
Collaborator

fasnicar commented Jul 7, 2020

Hi, I think it might be due to some connection instability.
I re-run the command you posted this morning and it is still running (downloading genomes from NCBI).
Are you able to try with a different Internet connection?

Thanks,
Francesco

@Lily-WL
Copy link
Author

Lily-WL commented Jul 8, 2020

Dear Francesco,

Thank you very much for your reply! I tried many times, the condition is similar. So I think if I can download the genome one by one for the remained. Is it possible to have the list of reference genomes? Does all the ones from "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA" is needed?

@fasnicar
Copy link
Collaborator

fasnicar commented Jul 8, 2020

Yes, you can play a bit with bash and the files downloaded by PhyloPhlAn at the beginning: taxa2genomes_cpa0.2_up201804.txt.bz2 and assembly_summary_genbank.txt.

  1. For each line in taxa2genomes_cpa0.2_up201804.txt.bz2 you should consider the first item of the list (; separated) of the third field (TAB separated)
  2. The ID from the previous step is in the form GCA_001905625.1, you should split it on the . and keep only the first part (i.e., GCA_001905625)
  3. Then you should get the ftp_path from the assembly_summary_genbank.txt that matches the previous ID to get the URL for downloading
  4. from the URL retrieved from the assembly_summary_genbank.txt file, you should replace ftp:// with https:// and append _genomic.fna.gz to the end

@Lily-WL
Copy link
Author

Lily-WL commented Jul 9, 2020

Thank you very much for your reply. In order to download the remained large number of genomes, can I revise the file "taxa2genomes_cpa0.2_up201804.txt.bz2" in which the downloaded genome information were cut? But it can't work.

(python3.7) [wl@ts-rd350 Phylophlan]$ phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose 2>&1 | tee logs/phylophlan_get_reference.log
phylophlan_get_reference.py version 3.0.16 (8 May 2020)

Command line: /home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference -g all -o input_genomes/ -n 1 --verbose

Arguments: {'get': 'all', 'list_clades': False, 'database_update': False, 'output_file_extension': '.fna.gz', 'output': 'input_genomes/', 'how_many': 1, 'genbank_mapping': 'assembly_summary_genbank.txt', 'verbose': True}
File "taxa2genomes.txt" present
File "taxa2genomes_cpa0.2_up201804.txt.bz2" present
Output folder "input_genomes/" present
File "assembly_summary_genbank.txt" present
Traceback (most recent call last):
File "/home/wl/.conda/envs/python3.7/bin/phylophlan_get_reference", line 10, in
sys.exit(phylophlan_get_reference())
File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 313, in phylophlan_get_reference
args.output_file_extension, args.output, args.database_update, verbose=args.verbose)
File "/home/wl/.conda/envs/python3.7/lib/python3.7/site-packages/phylophlan/phylophlan_get_reference.py", line 274, in get_reference_genomes
if (taxa_label in r_clean[1].split('|')) or (taxa_label == 'all'):
IndexError: list index out of range

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants