Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of ambiguous amino acids #346

Closed
donovan-h-parks opened this issue Mar 7, 2016 · 8 comments
Closed

Handling of ambiguous amino acids #346

donovan-h-parks opened this issue Mar 7, 2016 · 8 comments

Comments

@donovan-h-parks
Copy link

pplacer currently does not handle ambiguous bases. I appreciate that from a ML perspective fully handling such characters is challenging. However, I am wondering if ambiguous bases can simply be treated as unknowns and a warning generated. This would seem preferable to causing a full exception that disallows such sequences to be inserted into a tree:

Uncaught exception: Failure("J is not a known base in GCA_000389905.1_ASM38990v1_protein")
Fatal error: exception Failure("J is not a known base in GCA_000389905.1_ASM38990v1_protein")
Uncaught exception: Sys_error("./bacteria/chunk0/storage/tree/concatenated.pplacer.json: No such file or directory")
Fatal error: exception Sys_error("./bacteria/chunk0/storage/tree/concatenated.pplacer.json: No such file or directory")

Such situations are extremely problematic when processing large data sets where quality control over the input sequences can be challenging.

@matsen
Copy link
Owner

matsen commented Mar 7, 2016

Hello Donovan!

Is J an ambiguous base? I haven't heard of it: http://www.bioinformatics.org/sms/iupac.html

@donovan-h-parks
Copy link
Author

Guess it depends on who you ask:
http://www.insdc.org/files/feature_table.html#7.4.3

Also, it appears in a non-trivial number of genomes from GenBank. :)

@matsen
Copy link
Owner

matsen commented May 16, 2016

Hey @dparks1134 --

Thanks for your patience with this. I have implemented this, and as a diagnostic I have the following table:

  'A' 'R' 'N' 'D' 'C' 'Q' 'E' 'G' 'H' 'I' 'L' 'K' 'M' 'F' 'P' 'S' 'T' 'W' 'Y' 'V'
B {0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
J {0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}
Z {0.; 0.; 0.; 0.; 0.; 1.; 1.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.; 0.}

This is how I interpret the ambiguity codes. Does that look right to you?

I should also note that as you can see in be77c7d , for some reason I was previously interpreting B as a synonym for N and Z as a synonym for Q. So making this change will make a minor difference in some folks' analysis in which those letters appear.

Could you pull the new branch and give it a spin?

@donovan-h-parks
Copy link
Author

donovan-h-parks commented May 17, 2016

Thanks! Table looks good to me. I'm a bit swamped at the moment, but should be able to give this a spin in the next few weeks.

@matsen
Copy link
Owner

matsen commented May 17, 2016

No worries! Just let me know if it looks good and I'll merge.

@donovan-h-parks
Copy link
Author

Can you send me the binaries for this new release? We don't have a build environment for pplacer.

@donovan-h-parks
Copy link
Author

We complied the latest code and it looks to work great. I'd say make it official!

matsen added a commit that referenced this issue May 21, 2016
matsen added a commit that referenced this issue May 21, 2016
@matsen
Copy link
Owner

matsen commented May 22, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants