-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements and refactoring of Nucleotide.java #4846
Conversation
@samuelklee please take a look. |
There were travis issue solvable by rebasing (automatic merge didn't compile). Should work now. |
Codecov Report
@@ Coverage Diff @@
## master #4846 +/- ##
===============================================
+ Coverage 86.661% 86.684% +0.023%
- Complexity 29043 29179 +136
===============================================
Files 1808 1808
Lines 134662 135021 +359
Branches 14935 14986 +51
===============================================
+ Hits 116700 117042 +342
- Misses 12550 12563 +13
- Partials 5412 5416 +4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some typos and minor comments. I did not check closely for correctness, but your tests look OK to me.
|
||
// Extended codes: | ||
// CODE(included nucs) | ||
R(A,G), // Purine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicking, but perhaps make the descriptions here a little more consistent in style? Currently, some are plural, some end in periods, etc.
} | ||
|
||
/** | ||
* Checks whether the nucleotide refer to an ambiguous base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer -> refers
/** | ||
* Checks whether this nucleotide code encloses all possible nucleotides for another code. | ||
* @param other the other nucleotide to compare to. | ||
* @return {@code true} iff any nucleotide in {@code other} is enclosed it this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it -> in
} | ||
|
||
/** | ||
* Checks whether to base encodings make reference to the same {@link #Nucleotide} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to -> two
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed that ok.
* each possible nucleotide in this code. | ||
* </p> | ||
* <p> | ||
* The complement of the {@link #INVALID} nucleotide its itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its -> is
} | ||
|
||
/** | ||
* Returns the instance that would include all possible transition mutation from this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutation -> mutations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or perhaps: transition mutation -> transitions)
} | ||
|
||
/** | ||
* Returns the instance that would include all possible tranversions mutation from this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar here, transversions mutation -> transversion mutations (or transversions)
public long sum() { | ||
return LongStream.of(counts).sum(); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
white space
Assert.assertEquals(Nucleotide.X.toBase(), (byte)'X'); | ||
@Test(dataProvider = "values") | ||
public void testEncodeAsByte(final Nucleotide nuc) { | ||
// Will always use the first letter of the constant as the one byt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one byt -> ?
} | ||
Assert.assertSame(Nucleotide.valueOf(i), expected, "Failed with base " + i + " returning nucleotide " + Nucleotide.valueOf(i)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
white space
Hello @vruano - the code that you have here is quite nice, and I would like to include some bits on the next version of htsjdk (htsjdk-next-beta). Do you have any problem with that? |
@samuelklee |
@magicDGS sure, anything for the common good. |
b6636e8
to
6cf0639
Compare
final int lowerCaseIndex = nucleotide.lowerCaseByteEncoding & 0xFF; | ||
final int upperCaseIndex = nucleotide.upperCaseByteEncoding & 0xFF; | ||
maskToValue[nucleotide.mask] | ||
= baseToValue[lowerCaseIndex] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if multiple assignment here and below adheres to the Java style guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I quick google does not show anything against multiple assigns, perhaps you can post a link to it.
I will break into two lines the assignations to maskToValue from the other two to baseToValue as they go into different arrays, is perhaps more clear that way.
} | ||
|
||
private final int mask; | ||
private final boolean isStandard; | ||
|
||
// Some properties initialized after construction as these depends on some static arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends -> depend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
A couple more minor comments. I will take your word that your changes improve performance! |
c5d5284
to
45eab69
Compare
@samuelklee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more minor suggestions, which you can take or leave. Good to merge after!
|
||
/** | ||
* Represents the nucleotide alphabet with support for IUPAC ambiguity codes. | ||
* | ||
* <p> | ||
* This enumeration not only contains standard (non-ambiguous) nucleotides, but also | ||
* values to represent ambiguous and invalid codes. | ||
* values to represent ambiguous and an invalid nucleotide call ({@link #X} aka {@link #INVALID}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* </p> | ||
* | ||
* <p> | ||
* You can query whether a value refers to a non-ambiguous nucleotide with {@link #isStandard()} or | ||
* {@link #isAmbiguous()} depending of your preference. Notice that the special value {@link #X} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X
is grouped together with the ambiguous nucleotides below in line 95. Perhaps add a line break to avoid confusion.
* <p> | ||
* Querying the {@link #X} value for its {@link #complement}, {@link #transition} or | ||
* {@link #transversion} or using it in other operations | ||
* likes {@link #intersect} would result in returning also a {@link #X}; similar to {@link Double#NaN} in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likes intersect would result in returning also a X -> like (or perhaps "such as") intersect will return an X
* Finally, notice that there is no code of the "gap nucleotide" that may appear in aligned sequences as in fact that is not a nucleotide. | ||
* A base encoding using the typical gap representation such as '.' or '-' would | ||
* be interpreted as an {@link #INVALID} (i.e. {@link #X}) call which is probably not what you want. | ||
* So code that mus support those will need to do so outside this {@code enum}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So code that mus support - > So code to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, thanks for adding all of this documentation!
* convenient longer form names constant aliases (e.g. {@link #ADENINE} for {@link #A}, {@link #PURINE} for {@link #R}, etc.). | ||
* </p> | ||
* <p> | ||
* Finally, notice that there is no code of the "gap nucleotide" that may appear in aligned sequences as in fact that is not a nucleotide. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code of -> code for
|
||
/** | ||
* Returns this nucleotide's exclusive upper-case {@code byte} encoding. | ||
* @return <i>ditto</i>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can probably just leave out the @return
lines here and below.
public static Nucleotide decode(final char base) { | ||
return baseToValue[base & 0xFF]; | ||
public static Nucleotide decode(final char ch) { | ||
if ((ch & 0xFF00) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps extract the operations used in this method.
* Transform a single-letter character string into the corresponding nucleotide. | ||
* <p> | ||
* {@code Null}, empty or multi-letter input will result in an {@link IllegalArgumentException}. | ||
* These are not simply invalid encodings as the fact that are not a single character is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encodings as the fact that are not a single character -> encodings, and the fact that they are not a single character?
|
||
/** | ||
* Checks the assumption that each nuc canonical name is a single upper-case letter. | ||
* If this test fails that would indicates that you are modifying {@link Nucleotide} in a way |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would indicates -> that would indicate
Assert.assertEquals(subject.get(n), (long) shadow.getOrDefault(n, 0)); | ||
} | ||
Assert.assertEquals(subject.sum(), shadow.values().stream().mapToLong(l -> l).sum()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests!
…hars and CharSequences). Also the Counter has now methods to add `char` and String typed encodings. I have added test for these. I fixed a bug where the upper byte of `char` encodings was ignored (high-ordinal chars should be considered INVALID) The counter sum implementation is now more efficient avoding the use of streams. Improved documentation
Change method names
valueOf
andtoBase
fordecode
andencodeAsByte
. Reason is thatvalueOf
(taking an string for example) has a bit of a meaning in Java Enums (e.g. it should result in a NotSuchElementException if such a constant does not exist and we don't do that here (returns Nucleotide#X).Apart from that I added a few more methods to cover for functionality that I need as part of a larger pull-request for SV.