Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for circular sequences #67

Open
apcamargo opened this issue Dec 13, 2024 · 1 comment
Open

Support for circular sequences #67

apcamargo opened this issue Dec 13, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@apcamargo
Copy link

In theory, circular sequences should not contain partial genes, as genes spanning the breakpoint should appear seamlessly at both the beginning and the end of the sequence. However, Prodigal/pyrodigal currently do not: (1) assign the same gene ID to partial sequences at both ends of the sequence; and (2) more critically, it treats the sequence edges as independent. As a result, a partial gene can sometimes be identified at only one end.

To address this issue, I've been using a script that I wrote that iteratively changes the breakpoint to minimize gene truncation. However, this approach is obviously suboptimal and can occasionally fail (i.e. find no breakpoint that eliminates truncations), as the predicted genes may change with each breakpoint change.

Although addressing this limitation would require significant effort, it would pyrodigal stand out among gene prediction tools.

@althonos althonos added the enhancement New feature or request label Dec 29, 2024
@althonos
Copy link
Owner

After looking a bit into the issue, here's what I think is possible to do so while keeping the core algorithm by just allowing some nodes to "wrap" about both ends of the sequences. The rest of the scoring procedures (RBS detection, GC%, etc) can be further updated but I think most of them wouldn't change.

The biggest problem here is that we'd need to do some refactoring first, as there are some areas of the code where I can't really guarantee what happens when it receives negative indices or indices greater than the sequence length. In clean code this should potentially never happen but I found in some issues here that Prodigal already does some out-of-buffer indexing at times... I'll start a refactor first so it's easier to isolate the code that needs to be changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants