-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run RPS-BLAST+ and cdd2cog
#1
Comments
Hi Alexandre, Cog is the preformatted RPS-BLAST+ database that can be found here: ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cog_LE.tar.gz You're right that could be clearer in the README. You need to download all the files from the description and provide the paths to them with the respective options (or just put them into your working directory and use my provided examples in Usage). The query (protein.fasta) is a multi-FASTA protein file, i.e. the proteins you want to assign COGs to. You can get such a file e.g. with my Best, |
From Alexandre: Thank you for your answer: I ran the command for RPS-blast rpsblast -i c_prot.faa -d /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/ -o rps-blast.out -evalue 1e-2 -outfmt 6 and got this message: [rpsblast] ERROR: Expectation value (E) [value] is bad or out of range [? to ?] what e-value should I use? I post this message here only because this seems to be the only remaning issue toi the right execution. |
try 0.01: rpsblast -query c_prot.faa -db /home/lgmmicrorganismo/Programas_Analise/cdd2cog/banco_de_dados/Cog -out rps-blast.out -evalue 0.01 -outfmt 6 HTH, |
From Alexandre: Ok, just one last question. After running cdd2cog on my protein I got this: Overall assignment statistics: I think too few proteins, 739 of 3190 had functional categories assigned. Can this be improved? |
Hi Alexandre, sorry for replying that late. Unfortunately, the rpsblast is not the most sensitive. However, it's quite weird, that you only have so few functional categories. Normally the number of assigned functional categories should be higher than the proteins categorized into COGs, as many COGs are associated with several functional categories. What kind of a dataset are you using as queries? Must be proteins from very similar functional categories. This is what I get if I run the pipeline with all CDS from an E. coli genome (4750 CDS): Overall assignment statistics: You can use option -a of |
From Alexandre: Thank you for your answer. Regarding the dataset used as query, it is proteins from a bacteria. The proteins were extracted from the contigs using prodigal. |
From Alexandre: In the command line rpsblast+ -db < database> -query < query_sequence > -out <result.out > -evalue 1e-2 -outfmt '7 qseqid sseqid pident length mismatch gaopen qend sstard send bitscore qcovs' What's the default value for the highlighted parameters? |
Hi Alexandre, you can run either: or: The strings 'qseqid' etc. are a custom format for the output, this way you can include 'qcovs' to have the 'Query Coverage Per Subject'. See But these are both only examples given in the |
Hello, I found this discussion very useful, thanks. I would like to know, what do you expect from those proteins that could not be assigned to a COG, in my case i have several bacterial genomes, and in all cases, at least 1000 genes are not assigned. There are no conserved domains for them? Why can it be assigned as a hypothetical protein or function unknown? |
Hi @luciagrami, the initial COG release was very strict in including orthologs, especially regarding good annotation. Thus, the coverage of bacterial proteins is not very high, especially for bacteria that didn't have closed genomes at that time. COGs can only be assigned via CDD hits, i.e. where a COG is actually associated. There is a COG functional category with "function unknown" ([S]), but of course it is associated with certain COGs. Thus, for all your proteins without a COG classification or even without a CDD hit you can of course set it to "function unknown" by yourself. There's a new COG release from 2014 which has higher coverage of bacterial genomes, but haven't integrated it yet (see #2). |
Hi Andreas, I don't have any classification, I annotated the genome of bacteria using PROKKA. I try two ways to get the rps-blast.out. I used the file .faa (6.696 CDS) generated from PROKKA and the cds_extractor.pl script to get a file from .gbk. the command was: rpsblast -i PROKKA_02152017.faa -d Cog -o rps-blast.out -e 0.01 -m 6 Do you have any idea what's happening? Thanks in advance!! Graciela |
Hi @gracielad, Second, you're using the legacy
Please check your version with
For a working example (positive control), here are the commands for a complete run with E. coli K-12 MG1655:
And here is the corresponding output with MG1655: Best, |
Hi Andres, It's work!! Thanks!!! All the best!! |
Following this example rpsblast claims the arguments are wrong, and it works after I adjust the command to
which gets the job done, however the conversion to COGs with
gets the result
which brings no information, since nothing was identified. It also outputs a lot of "Use of uninitialized value" which probably means the CDD's IDs are not being recognized. The rest of the commands were used as suggested. |
I have exactly the same issue as @iquasere. Please let me know how to deal with it in case if you solved the problem. |
@utkinaira I did manage to find the answer here. Turns out the IDs of CDD changed format, changing the output of rps-blast |
@iquasere Thank you so much, I'll try it! |
A master student from Brazil contacted me via email with questions how to run RPS-BLAST+ and
cdd2cog.pl
correctly. I'm copying the correspondence in here in case it is useful for someone else:Hi, I am a master student of genetics at the Universidade Federal de Minas Gerais, Brasil. I was reading the cdd2goc description at
https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog#rps-blast
In the line referring to the use of RPS-Blast :
rpsblast -query protein.fasta -db Cog -out rps-blast.out -evalue 1e-2 -outfmt 6
I am confuse about the cog, which I highlighted. Is this another database we must download? If so, where could I find it, and to perform a search for protein sequences of a draft bacterial genome I assembled, how database should I get? Thank you in advance.
The text was updated successfully, but these errors were encountered: