Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some protein sequences in your file are identical. #49

Closed
alexvasilikop opened this issue Jul 27, 2023 · 6 comments
Closed

Some protein sequences in your file are identical. #49

alexvasilikop opened this issue Jul 27, 2023 · 6 comments

Comments

@alexvasilikop
Copy link

Hello Darrin,

I am getting the mentioned error and subsequent crash when I cloned the most recent odp version whereas I was not getting this error before. However, some of my genomes have a known event ancient whole genome duplication and It could be indeed that many protein sequences on different chromosomes (homoeologs) have identical sequences. I am sceptical as to whether it is really necessary to introduce this legality check.

I also downloaded some publicly available genomes and had the same problems. The identical sequences are not different isoforms as they come from a different locus (gene and locus ID). i made sure to remove identical or alternative isoforms before running odp.

Any thoughts on this?
Thanks
Alex

@alexvasilikop
Copy link
Author

alexvasilikop commented Jul 27, 2023

Short update. I checked three of the potential duplicate sequences

>FUN_002930-T1
MSGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEEVRGVLK
IFLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGS
>FUN_004808-T1
MLGQPALLNTWNGDYHTIGYGNYGYGMYTQDSGLYLNPWNGRVGVEQEVDFTPLDAYHFQ
PYTGIPATIRQHYIS
>FUN_004818-T1
MYKTLTLILLITVGVFTTINAATIERADIETGELELSEEQYQGLAQCDQNVLEVELKDKH
QYRALKLIPYDIIIKPPTDCWWRYRLFMALNYQKYKDWANKYCRSYIGCWCCPGGGLCVT
FIVKPDFWRCFIIDLPIFKEKIPRIPVFEPELDIIRQ

Apparently they are not identical. Any reasons for the error then?

cheers

@conchoecia
Copy link
Owner

Hi Alex,

The point of this error message is to make people think about their input data and why it might be that there are identical proteins. The biggest reason this may happen is due to an annotation that is derived from de novo transcriptomes that hasn't been de-duplicated. In these cases it is best to investigate whether these are real or not, as having misannotated proteins will detract from detecting the true signal of chromosomes evolution.

In your case there is a flag to shut off the warnings: duplicate_proteins: "fail" # currently only "fail" or "best". Fail doesn't allow duplicate names or seqs. < Click on this text to see the flag in the context of the config.yaml file.

About the three potential duplicate sequences, the way that the warning message works is it just shows three proteins that have other proteins identical to them in the protein.fasta file. The warning doesn't mean that these three proteins are identical to one another. I will have to modify the message to make that clear.

@alexvasilikop
Copy link
Author

Ah ok thanks a lot Darrin! Yes I was a bit confused with the message but it is clear now.

Thanks again
Alex

@conchoecia
Copy link
Owner

TODO: clarify the error messages in both odp and nway_rbh

@conchoecia
Copy link
Owner

See also #49

@conchoecia
Copy link
Owner

The messaging is clearer now in the recent push ba6c453

Closing for now - see #45 for further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants