-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) #397
Comments
@ahwagner we would like you to weigh in on this so we can put a stop gap solution into vrs-python ASAP. even if we have to revisit a more formal decision later. |
I agree with the solution proposed by @toneillbroad. |
It sounds like based on discussion with @larrybabb that this is only a problem for genomic sequences, not transcripts. (so by cases:
others? |
With great frustration with multithreading in Python, I have found a way to work around this issue in client code at a higher level that doesn't add that much overhead. Using a background task queue, a return value queue, a background process which runs the tasks and can be interrupted, and a timeout on return values, I can terminate any call into a It may still be nice to implement something in vrs-python which checks the sequence beforehand, or in bioutils during roll left/right, because this would make this available to other codebases which use vrs-python. Or I could look at adding something like a translator wrapper which has the timeout logic built in. |
@theferrit32 have we proposed implementing this over in Biocommons? I agree that it makes sense for us to implement the solution there. |
When trying to normalize the variant
NC_000015.9:g.7211_7214del
the routine will go into a seemingly endless routine to try to figure out thenormalized
result for theAllele.state
.Without a full analysis their is evidence that this is likely caused by the fact that the first 17 million bases in chromosome 15 are all
N
s. So as it rolls right/left to get to a unique sequence region it will go on for an impractical amount of time.I suggest we put a limit in terms of how large the sequence can grow up to when normalizing the Allele. But we should discuss how to best handle this.
@toneillbroad just suggested that maybe we simply disallow any normalization that includes ambiguity coded bases not
A, C, T or G
. I sort of like that as a general rule of thumb, since it is very difficult to address the truenormality
of a sequence that includes any of the ambiguity codes. We can make this a vrs-python rule so that our normalizer doesn't go off and never return in these portions of the reference sequencesThe text was updated successfully, but these errors were encountered: