Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements and refactoring of Nucleotide.java #4846

Merged
merged 4 commits into from
Aug 27, 2018
Merged

Improvements and refactoring of Nucleotide.java #4846

merged 4 commits into from
Aug 27, 2018

Conversation

vruano
Copy link
Contributor

@vruano vruano commented Jun 1, 2018

Change method names valueOf and toBase for decode and encodeAsByte. Reason is that valueOf (taking an string for example) has a bit of a meaning in Java Enums (e.g. it should result in a NotSuchElementException if such a constant does not exist and we don't do that here (returns Nucleotide#X).

Apart from that I added a few more methods to cover for functionality that I need as part of a larger pull-request for SV.

@vruano
Copy link
Contributor Author

vruano commented Jun 1, 2018

@samuelklee please take a look.

@vruano
Copy link
Contributor Author

vruano commented Jun 1, 2018

There were travis issue solvable by rebasing (automatic merge didn't compile). Should work now.

@codecov-io
Copy link

codecov-io commented Jun 1, 2018

Codecov Report

Merging #4846 into master will increase coverage by 0.023%.
The diff coverage is 94.318%.

@@               Coverage Diff               @@
##              master     #4846       +/-   ##
===============================================
+ Coverage     86.661%   86.684%   +0.023%     
- Complexity     29043     29179      +136     
===============================================
  Files           1808      1808               
  Lines         134662    135021      +359     
  Branches       14935     14986       +51     
===============================================
+ Hits          116700    117042      +342     
- Misses         12550     12563       +13     
- Partials        5412      5416        +4
Impacted Files Coverage Δ Complexity Δ
...llbender/utils/reference/FastaReferenceWriter.java 92.414% <0%> (ø) 51 <0> (ø) ⬇️
...ers/readorientation/LearnReadOrientationModel.java 96% <100%> (ø) 30 <0> (ø) ⬇️
...org/broadinstitute/hellbender/utils/RandomDNA.java 95.556% <100%> (ø) 23 <0> (ø) ⬇️
...er/formats/collections/AllelicCountCollection.java 100% <100%> (ø) 5 <0> (ø) ⬇️
...lbender/tools/copynumber/CollectAllelicCounts.java 92% <100%> (ø) 9 <0> (ø) ⬇️
...dinstitute/hellbender/utils/RandomDNAUnitTest.java 93.151% <100%> (ø) 37 <0> (ø) ⬇️
...llbender/tools/copynumber/PreprocessIntervals.java 100% <100%> (ø) 19 <2> (ø) ⬇️
...adorientation/LearnReadOrientationModelEngine.java 94.872% <100%> (ø) 37 <0> (ø) ⬇️
...ols/walkers/readorientation/CollectF1R2Counts.java 75.641% <100%> (ø) 29 <6> (ø) ⬇️
...s/walkers/readorientation/F1R2FilterConstants.java 83.333% <100%> (ø) 4 <0> (ø) ⬇️
... and 5 more

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some typos and minor comments. I did not check closely for correctness, but your tests look OK to me.


// Extended codes:
// CODE(included nucs)
R(A,G), // Purine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but perhaps make the descriptions here a little more consistent in style? Currently, some are plural, some end in periods, etc.

}

/**
* Checks whether the nucleotide refer to an ambiguous base.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer -> refers

/**
* Checks whether this nucleotide code encloses all possible nucleotides for another code.
* @param other the other nucleotide to compare to.
* @return {@code true} iff any nucleotide in {@code other} is enclosed it this code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it -> in

}

/**
* Checks whether to base encodings make reference to the same {@link #Nucleotide}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to -> two

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed that ok.

* each possible nucleotide in this code.
* </p>
* <p>
* The complement of the {@link #INVALID} nucleotide its itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its -> is

}

/**
* Returns the instance that would include all possible transition mutation from this one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutation -> mutations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or perhaps: transition mutation -> transitions)

}

/**
* Returns the instance that would include all possible tranversions mutation from this one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar here, transversions mutation -> transversion mutations (or transversions)

public long sum() {
return LongStream.of(counts).sum();
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space

Assert.assertEquals(Nucleotide.X.toBase(), (byte)'X');
@Test(dataProvider = "values")
public void testEncodeAsByte(final Nucleotide nuc) {
// Will always use the first letter of the constant as the one byt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one byt -> ?

}
Assert.assertSame(Nucleotide.valueOf(i), expected, "Failed with base " + i + " returning nucleotide " + Nucleotide.valueOf(i));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space

@samuelklee samuelklee assigned vruano and unassigned samuelklee Jun 20, 2018
@magicDGS
Copy link
Contributor

Hello @vruano - the code that you have here is quite nice, and I would like to include some bits on the next version of htsjdk (htsjdk-next-beta). Do you have any problem with that?

@vruano
Copy link
Contributor Author

vruano commented Aug 18, 2018

@samuelklee
Sorry for the delay, I had addressed your comments plus some conflict resolutions from a rebase.
Please let me know if you want me to change anything else.

@vruano vruano assigned samuelklee and unassigned vruano Aug 18, 2018
@vruano
Copy link
Contributor Author

vruano commented Aug 18, 2018

@magicDGS sure, anything for the common good.

@vruano vruano force-pushed the vrr_nucs branch 2 times, most recently from b6636e8 to 6cf0639 Compare August 21, 2018 01:15
final int lowerCaseIndex = nucleotide.lowerCaseByteEncoding & 0xFF;
final int upperCaseIndex = nucleotide.upperCaseByteEncoding & 0xFF;
maskToValue[nucleotide.mask]
= baseToValue[lowerCaseIndex]
Copy link
Contributor

@samuelklee samuelklee Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if multiple assignment here and below adheres to the Java style guide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quick google does not show anything against multiple assigns, perhaps you can post a link to it.

I will break into two lines the assignations to maskToValue from the other two to baseToValue as they go into different arrays, is perhaps more clear that way.

}

private final int mask;
private final boolean isStandard;

// Some properties initialized after construction as these depends on some static arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends -> depend

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@samuelklee
Copy link
Contributor

A couple more minor comments. I will take your word that your changes improve performance!

@vruano vruano force-pushed the vrr_nucs branch 3 times, most recently from c5d5284 to 45eab69 Compare August 21, 2018 20:59
@vruano
Copy link
Contributor Author

vruano commented Aug 21, 2018

@samuelklee
I addressed your comments + improved docs + added support for char and String encodings. Won't add anything else after your final review, promissed : P

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more minor suggestions, which you can take or leave. Good to merge after!


/**
* Represents the nucleotide alphabet with support for IUPAC ambiguity codes.
*
* <p>
* This enumeration not only contains standard (non-ambiguous) nucleotides, but also
* values to represent ambiguous and invalid codes.
* values to represent ambiguous and an invalid nucleotide call ({@link #X} aka {@link #INVALID}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "...but also contains ambiguous nucleotides, as well as a code {@link #X} (a.k.a. {@link #INVALID}) for invalid nucleotide calls."

* </p>
*
* <p>
* You can query whether a value refers to a non-ambiguous nucleotide with {@link #isStandard()} or
* {@link #isAmbiguous()} depending of your preference. Notice that the special value {@link #X}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X is grouped together with the ambiguous nucleotides below in line 95. Perhaps add a line break to avoid confusion.

* <p>
* Querying the {@link #X} value for its {@link #complement}, {@link #transition} or
* {@link #transversion} or using it in other operations
* likes {@link #intersect} would result in returning also a {@link #X}; similar to {@link Double#NaN} in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likes intersect would result in returning also a X -> like (or perhaps "such as") intersect will return an X

* Finally, notice that there is no code of the "gap nucleotide" that may appear in aligned sequences as in fact that is not a nucleotide.
* A base encoding using the typical gap representation such as '.' or '-' would
* be interpreted as an {@link #INVALID} (i.e. {@link #X}) call which is probably not what you want.
* So code that mus support those will need to do so outside this {@code enum}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So code that mus support - > So code to support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, thanks for adding all of this documentation!

* convenient longer form names constant aliases (e.g. {@link #ADENINE} for {@link #A}, {@link #PURINE} for {@link #R}, etc.).
* </p>
* <p>
* Finally, notice that there is no code of the "gap nucleotide" that may appear in aligned sequences as in fact that is not a nucleotide.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code of -> code for


/**
* Returns this nucleotide's exclusive upper-case {@code byte} encoding.
* @return <i>ditto</i>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably just leave out the @return lines here and below.

public static Nucleotide decode(final char base) {
return baseToValue[base & 0xFF];
public static Nucleotide decode(final char ch) {
if ((ch & 0xFF00) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps extract the operations used in this method.

* Transform a single-letter character string into the corresponding nucleotide.
* <p>
* {@code Null}, empty or multi-letter input will result in an {@link IllegalArgumentException}.
* These are not simply invalid encodings as the fact that are not a single character is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodings as the fact that are not a single character -> encodings, and the fact that they are not a single character?


/**
* Checks the assumption that each nuc canonical name is a single upper-case letter.
* If this test fails that would indicates that you are modifying {@link Nucleotide} in a way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would indicates -> that would indicate

Assert.assertEquals(subject.get(n), (long) shadow.getOrDefault(n, 0));
}
Assert.assertEquals(subject.sum(), shadow.values().stream().mapToLong(l -> l).sum());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests!

vruano added 4 commits August 27, 2018 11:09
…hars and CharSequences).

Also the Counter has now methods to add `char` and String typed encodings.
I have added test for these.
I fixed a bug where the upper byte of `char` encodings was ignored (high-ordinal chars should be considered INVALID)
The counter sum implementation is now more efficient avoding the use of streams.
Improved documentation
@vruano vruano merged commit 9271542 into master Aug 27, 2018
@vruano vruano deleted the vrr_nucs branch August 27, 2018 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants