Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoodCigarFilter + tests #380

Merged
merged 1 commit into from
Apr 27, 2015
Merged

GoodCigarFilter + tests #380

merged 1 commit into from
Apr 27, 2015

Conversation

akiezun
Copy link
Contributor

@akiezun akiezun commented Apr 16, 2015

ported BadCigarFilter (in Hellbender filters have positive names, in concordance with Java filter semantics) + tests (added 2 tests to cover 2 more branches, 1 branch seems unreachable).

@vruano please review
addresses #373

@akiezun akiezun added this to the BlobFish milestone Apr 16, 2015
@coveralls
Copy link
Collaborator

Coverage Status

Coverage increased (+0.04%) to 71.28% when pulling 0d41980 on ak_good_read_filter into 06505cf on master.

@akiezun akiezun assigned lbergelson and unassigned vruano Apr 17, 2015
@akiezun
Copy link
Contributor Author

akiezun commented Apr 17, 2015

@vruano is on vacation. @lbergelson please review

@vruano
Copy link
Contributor

vruano commented Apr 18, 2015

Done with my review. I'm not assigning it back to @akiezun in case @lbergelson wants to add.

@akiezun akiezun assigned vruano and unassigned lbergelson Apr 19, 2015
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.84%) to 70.4% when pulling 0ea45ee on ak_good_read_filter into 06505cf on master.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 20, 2015

I think we should write most readable code for now and optimize with a profiler later. The regexp machinery in Java libraries definitely does a better job optimizing the state machine than a person can.

@vruano
Copy link
Contributor

vruano commented Apr 21, 2015

Finished with my review back to @akiezun

@vruano vruano assigned akiezun and unassigned vruano Apr 21, 2015
@vruano
Copy link
Contributor

vruano commented Apr 21, 2015

Please check on my general comments posted at the end of the second commit, I thought that they would be posted here as well.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2015

@vruano you're raising 3 issues:

  1. is the algorithm correct?
  2. is the algorithm's results the same as in GATK3?
  3. is the algorithm esthetically pleasing/readable.

For 1, I'm going to create a new ticket - it's not under discussion here.
For 2, I'll enter the cases you're suggesting as unit tests and see what happens.
For 3, I can work on making it read better.

My first goal in this tickets is to port the existing filter. If we can improve the readability at the same time, that's great. Fixing any bugs or deficiencies of the algorithm is strictly future work (if it has been broken all this time in GATK3, it's ok for it to be broken for a few more weeks in GATK4).

@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2015

@eitanbanks and @vdauwera I heard from @droazen that you guys are working on the BadCigarFilter in GATK3. Can you summarize the work? we need to synchronize the changes/improvements across the 2 codebases.

@eitanbanks
Copy link
Contributor

BadCigarFilter was made a default Walker filter and functionality was added to it so that it now also filters out reads whose cigar length does not equal the read length.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2015

is the code in?

On Tue, Apr 21, 2015 at 4:51 PM, Eric Banks [email protected]
wrote:

BadCigarFilter was made a default Walker filter and functionality was
added to it so that it now also filters out reads whose cigar length does
not equal the read length.


Reply to this email directly or view it on GitHub
#380 (comment)
.

@vdauwera
Copy link
Contributor

See https://github.com/broadinstitute/gsa-unstable/pull/930

The title is misleading -- see the last few comments in the conversation for a summary of the work done.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 21, 2015

why not rename the issue then?

@akiezun
Copy link
Contributor Author

akiezun commented Apr 22, 2015

@eitanbanks thanks. the code and tests are now in hellbender too.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 22, 2015

@vruano can you give an example for "Also it does accept non empty CIGAR with out any operator that consume read bases such as N or P. That would not happen with the GATK3 code." ?

@akiezun
Copy link
Contributor Author

akiezun commented Apr 22, 2015

@vruano back to you

I fixed the 1H1H1M problem you brought up "GATK3's filter would not complain for consecutive H operations at the beginning whereas the new code does" and I entered tests for it (one for head, one for tail).

I entered #428 for clarifying the spec.

@akiezun akiezun assigned vruano and unassigned akiezun Apr 22, 2015
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.84%) to 70.4% when pulling 1f1efaa on ak_good_read_filter into 06505cf on master.

@vruano
Copy link
Contributor

vruano commented Apr 22, 2015

What I meant is that for example "10P" or "101N" would fail in the GATK3 code but the would not in the current pull-request. The reason is that the P and N operators return false for the method "consumesReadBases()" and the GATK3 code pays attention to that; it requires that at least there is an operator that consumes read bases in the non-clipping core section of the cigar (I,M,EQ or X).

@akiezun
Copy link
Contributor Author

akiezun commented Apr 23, 2015

alright @vruano @droazen
as discussed today I will recode this as a bunch of loops - the GATK3 code is atrocious but my regexp did not me win the fan following i was going for. And ultimately it's about the readability at this point (which is a subjective manner, you guys said loops would make it more readable and people's opinion defines 'readability')

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.84%) to 70.4% when pulling c1abddf on ak_good_read_filter into 06505cf on master.

@vruano
Copy link
Contributor

vruano commented Apr 23, 2015

"@vruano to clarify. both 10P and 101N pass on GATK3. I just checked."

I see... it seems that this might be cause by yet another bug in GATK3 code ... but I guess that is to be left for the other issue.

From this code within the second loop:

if (!hasMeaningfulElements && op.consumesReadBases()) {
                    hasMeaningfulElements = true;
}

one would think that to have a meaningful-element is to have one that consumes read bases.

However it turns out that this if condition is never true as:

boolean hasMeaningfulElements = (firstOp != CigarOperator.H && firstOp != CigarOperator.S);

is guaranteed to initialize it to true.

So that is some code you can get rid off for now. But I guess that it need to be fixed eventually.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 24, 2015

Does anyone know why there are 3 implementation of validity for Cigars in GATK3?

  1. in the BadCigarFilter
  2. in CigarUtils
  3. in Cigar itself ?

@akiezun
Copy link
Contributor Author

akiezun commented Apr 24, 2015

the third one is in htsjdk obviously. anyway, i'm merging the first two

@akiezun
Copy link
Contributor Author

akiezun commented Apr 24, 2015

I clarified the semantics of Good CIGAR. It does not make sense to have 'good cigars' that are invalid according to htsjdk so I clarified the semantics to state that a "good" cigar is a valid cigar that passes additional criteria:

  - does not have consecutive I/D elements
  - does not start or end with deletions (with or without preceding clips).

(now, we can argue about those criteria but those are the criteria in GATK3 and the current code and doc makes this explicit.)

I rewrote the whole code again to reflect this change. @vruano back to you.

@akiezun akiezun assigned vruano and unassigned akiezun Apr 24, 2015
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.78%) to 70.46% when pulling 012d037 on ak_good_read_filter into 06505cf on master.

@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.78%) to 70.46% when pulling 012d037 on ak_good_read_filter into 06505cf on master.

@vruano
Copy link
Contributor

vruano commented Apr 24, 2015

The code is far easier to understand now. Mostly minor changes and a suggestion to improve the readability of one of the supporting private methods.

CigarUtils.countRefBasesBasedOnCigar have some issues but these are not really part of your code so it might be fair to address them separately in another pull-request.

Back to @akiezun.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 25, 2015

This filter did not deserve a special file after I moved the isGood method to CigarUtils. So I moved this filter to ReadFilterLibrary and moved tests accordingly.

back to @vruano

@akiezun akiezun assigned vruano and unassigned akiezun Apr 25, 2015
@akiezun akiezun mentioned this pull request Apr 26, 2015
@vruano
Copy link
Contributor

vruano commented Apr 26, 2015

@akiezun please rebase, squash and merge.

@vruano vruano assigned akiezun and unassigned vruano Apr 26, 2015
@akiezun akiezun force-pushed the ak_good_read_filter branch 3 times, most recently from 3589239 to 175fd37 Compare April 27, 2015 01:03
@coveralls
Copy link
Collaborator

Coverage Status

Coverage decreased (-0.06%) to 70.65% when pulling 175fd37 on ak_good_read_filter into f1eefec on master.

akiezun added a commit that referenced this pull request Apr 27, 2015
@akiezun akiezun merged commit b07f457 into master Apr 27, 2015
@akiezun akiezun deleted the ak_good_read_filter branch April 27, 2015 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants