-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CSV-196] Track byte position #502
Conversation
…#9) Add support in Commons CSV for tracking byte positions during parsing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Thank you for providing a PR!
Please see my comments. Note that some comments will apply to more to one location in the code.
Make sure you rebase on git master.
* @throws IOException If an I/O error occurs | ||
* @throws CSVException Thrown on invalid input. | ||
*/ | ||
public CSVParser parse(final Reader reader, final long characterOffset, final long recordNumber, String encoding) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new public constructors please. You can augment the builder and add a private constructor.
All new and protected elements should have a Java @since 1.13.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that there is only builder class in CSVFormat, there are no builder class in CSVParser, would you like me to add one?
There are CSVParserBuilder() from com.opencsv.CSVParser, that is another package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh? See
public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like I did not rebase on the master branch. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is not resolved.
* If there is a problem reading the header or skipping the first record | ||
* @throws CSVException Thrown on invalid input. | ||
*/ | ||
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
@@ -144,6 +159,15 @@ public long getCharacterPosition() { | |||
return characterPosition; | |||
} | |||
|
|||
/** | |||
* Returns the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the same Javadoc patterns as in other getter methods: A getter "Gets...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
private long bytesReadMark; | ||
|
||
/** Encoder used to calculate the bytes of characters */ | ||
CharsetEncoder encoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this new instance variable private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved.
/** | ||
* Constructs a new instance using the default buffer size. | ||
*/ | ||
ExtendedBufferedReader(final Reader reader) { | ||
super(reader); | ||
} | ||
|
||
ExtendedBufferedReader(final Reader reader, String encoding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a Charset
instead of a Charset name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved.
/** | ||
* Returns the start byte of this record as a character byte in the source stream. | ||
* | ||
* @return the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either the code or the comment is wrong. The return value is a long
but the comment talks about a byte
. It would help to make the comment clear as to why the return type is a long
if that is indeed correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New public and protected elements should have a Javadoc tag of @since 1.13.0
.
Please clarify the comment: Specifically document this data as "position".
@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, Iterable<String> { | |||
*/ | |||
private final long characterPosition; | |||
|
|||
/** | |||
* The start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is confusing because it talks about a byte
but the type is a long
. It would be better to explain the mismatch, if that is indeed the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calculating the number of bytes required to encode a single character and storing this value as a long
. This choice provides a larger range for the position and ensures consistency with the design, as position is also defined as a long
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment to clarify.
* @throws CSVException Thrown on invalid input. | ||
*/ | ||
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber, | ||
String encoding) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a Charset
, not a charset name.
src/main/java/org/apache/commons/csv/ExtendedBufferedReader.java
Outdated
Show resolved
Hide resolved
…#12) Add support in Commons CSV for tracking byte positions during parsing
Hi Gary,
Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Thank you for your updates. There is one adjustment to make where code is not needed because it is provided in the superclass.
Hi Gary, Just submit another change. Please note that due to the encoding algorithm used, it only supports UTF-8 for now. Full encoding can be supported; I just need to put more effort into this. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Now that I look at this again, we are always using the Charset, which is not enough to only have the feature enabled as an opt-in. This is due to the Charset being used to read the file in the first place, it's likely that the Charset will be often set, so I think we need a boolean flag that says "recordByteCount". The flag should drive whether the feature is enabled or not, not the Charset. You still need the Charset, but that's not what enables the feature.
Does that make sense? I think adding a boolean recordByteCount to the builder is what's needed here. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
There are many comments that were incorrectly marked resolved. See also my comment about adding a boolean to drive the new opt-in behavior.
Hi Gary, Yes, that makes sense. I agree that relying solely on the Charset to enable the feature isn’t sufficient, as the Charset is often set independently of whether the feature should be enabled. In our original design, instead of using a boolean, we used a String variable because it served a dual purpose: first, it controlled whether the feature was enabled or not, and second, it specified the Charset and the corresponding byte-counting algorithm to be used. Do you consider using a String? |
Hello @DarrenJAN I don't understand how a String can work. Let's say I specify I want to read a CSV file encoded in Charset X. The Charset (encoder) that is used by the new counting feature MUST match X, otherwise the counting risks being mismatched. Or am I missing something? Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off. If the PR only supports a subset of Charsets, then the Javadoc of the setter for the feature toggle must document this. It's also likely that I am not seeing how the PR code only supports some Charsets and not others. What's missing? TY. |
Hi Gary, In ExtendedBufferedReader.java, this function getCharBytes(int current), the logic of this code is based on the UTF-8 encoding specification. We need extra implementation to support the full set of characters. "Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off." |
Hi Gary, any thoughts on this? |
Hello @DarrenJAN |
Adding a boolean to drive byte tracking opt-in behavior
Hi @garydgregory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Thank you for your updates. Please see my comments.
@@ -153,6 +153,7 @@ public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> { | |||
private CSVFormat format; | |||
private long characterOffset; | |||
private long recordNumber = 1; | |||
private boolean enableByteTracking = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't initialize to the default value.
@@ -144,6 +159,15 @@ public long getCharacterPosition() { | |||
return characterPosition; | |||
} | |||
|
|||
/** | |||
* Returns the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
lastChar = current; | ||
position++; | ||
return lastChar; | ||
} | ||
|
||
/** | ||
* In Java, the {@code char} data type is based on the original Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getter get, IOW, the Javadoc should start with "Gets ...".
@@ -103,6 +103,15 @@ long getCharacterPosition() { | |||
return reader.getPosition(); | |||
} | |||
|
|||
/** | |||
* Returns the number of bytes read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Returns" -> "Gets".
|
||
@Test | ||
public void testGetRecordFourBytesRead() throws Exception { | ||
String code = "id,a,b,c\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use final
.
* Fix comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Thank you for updates.
I think we have the basic logic done but I have some additional change requests.
TY!
* @throws IOException | ||
* If there is a problem reading the header or skipping the first record. | ||
* @throws CSVException Thrown on invalid input. | ||
* @since 1.13.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New private
elements do not need a Javadoc since
tag.
* @return the byte length of the character. | ||
* @throws CharacterCodingException if the character cannot be encoded. | ||
*/ | ||
private long getCharBytes(int current) throws CharacterCodingException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type here should be an int
because this API returns 0
or CharBuffer.limit()
which itself returns an int
.
* @return the byte length of the character. | ||
* @throws CharacterCodingException if the character cannot be encoded. | ||
*/ | ||
private long getCharBytes(int current) throws CharacterCodingException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better method name here is getEncodedCharLength()
.
* @param recordNumber | ||
* The next record number to assign. | ||
* @param charset | ||
* The character encoding to be used for the reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The character encoding to be used for the reader."
->
"The character encoding to be used for the reader when enableByteTracking is true."
/** | ||
* The start byte of this record as a character byte in the source stream. | ||
*/ | ||
private final long characterByte; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is too confusing IMO, please rename to bytePosition
to mirror the existing characterPosition
.
Hi @garydgregory, Happy Holidays! The changes that I made:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
Thank you for updating the PR.
You missed a couple of my comments regarding documentation.
Please rebase on git master to pickup updates that will automatically show you where you missed using final
and other Checkstyle updates. You'll have to resolve conflicts in one test file.
TY & happy hols!
@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, Iterable<String> { | |||
*/ | |||
private final long characterPosition; | |||
|
|||
/** | |||
* The start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment to clarify.
/** | ||
* Returns the start byte of this record as a character byte in the source stream. | ||
* | ||
* @return the start byte of this record as a character byte in the source stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New public and protected elements should have a Javadoc tag of @since 1.13.0
.
Please clarify the comment: Specifically document this data as "position".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @DarrenJAN
I cannot edit files in this PR, which is unusual; otherwise, I would have added the missing Javadoc @since
tags. Please do so and see my unresolved comments.
TY!
Hi @garydgregory, I checked and made the enhancements:
|
Hi @garydgregory , Thank you very much. When is the next release? Yuzhan |
Hello @DarrenJAN https://repository.apache.org/content/repositories/snapshots/ |
Add support in Commons CSV for tracking byte positions during parsing.
Summary of Modifications
Constructor Enhancements
a. Added support for an optional parameter -- String encoding--, which specifies the encoding to use for the reader.
Add new Constructor: support track byte positions in record class
Test result:
mvn
Pass unit tests and other restrictions