Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CSV-196] Track byte position #502

Merged
merged 9 commits into from
Jan 2, 2025
Merged

Conversation

DarrenJAN
Copy link
Contributor

Add support in Commons CSV for tracking byte positions during parsing.

Summary of Modifications

  1. Test Data Files: Added new test data files, and updated pom.xml to exclude these files from RAT checks, avoiding unapproved license checks.
  2. CSVParser class:
    Constructor Enhancements
    a. Added support for an optional parameter -- String encoding--, which specifies the encoding to use for the reader.
  3. CSVRecord class
  • private long characterByte: start byte position of this record
    Add new Constructor: support track byte positions in record class
  1. ExtendedBufferedReader Class:
  • private long bytesRead: Tracks the number of bytes read so far.
  • private long bytesReadMark: Stores the marked byte position.
  • CharsetEncoder encoder: Encoder used to calculate byte size of characters.
  • getCharBytes(int current): This function calculates character bytes based on UTF-8 encoding. Note: it only supports UTF-8 due to the encoding algorithm used. Full encoding can be supported and we just need more effort on this.
  • reset() and mark() Methods: Enhanced to prevent consuming characters and bytes unintentionally.

Test result:
mvn
image
image
Pass unit tests and other restrictions

…#9)

Add support in Commons CSV for tracking byte positions during parsing
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for providing a PR!

Please see my comments. Note that some comments will apply to more to one location in the code.

Make sure you rebase on git master.

* @throws IOException If an I/O error occurs
* @throws CSVException Thrown on invalid input.
*/
public CSVParser parse(final Reader reader, final long characterOffset, final long recordNumber, String encoding) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new public constructors please. You can augment the builder and add a private constructor.

All new and protected elements should have a Java @since 1.13.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there is only builder class in CSVFormat, there are no builder class in CSVParser, would you like me to add one?
There are CSVParserBuilder() from com.opencsv.CSVParser, that is another package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh? See

public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like I did not rebase on the master branch. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is not resolved.

* If there is a problem reading the header or skipping the first record
* @throws CSVException Thrown on invalid input.
*/
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

@@ -144,6 +159,15 @@ public long getCharacterPosition() {
return characterPosition;
}

/**
* Returns the start byte of this record as a character byte in the source stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the same Javadoc patterns as in other getter methods: A getter "Gets...".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

private long bytesReadMark;

/** Encoder used to calculate the bytes of characters */
CharsetEncoder encoder;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this new instance variable private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved.

/**
* Constructs a new instance using the default buffer size.
*/
ExtendedBufferedReader(final Reader reader) {
super(reader);
}

ExtendedBufferedReader(final Reader reader, String encoding) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Charset instead of a Charset name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved.

src/test/java/org/apache/commons/csv/CSVParserTest.java Outdated Show resolved Hide resolved
/**
* Returns the start byte of this record as a character byte in the source stream.
*
* @return the start byte of this record as a character byte in the source stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either the code or the comment is wrong. The return value is a long but the comment talks about a byte. It would help to make the comment clear as to why the return type is a long if that is indeed correct.

Copy link
Member

@garydgregory garydgregory Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, Iterable<String> {
*/
private final long characterPosition;

/**
* The start byte of this record as a character byte in the source stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing because it talks about a byte but the type is a long. It would be better to explain the mismatch, if that is indeed the intent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calculating the number of bytes required to encode a single character and storing this value as a long. This choice provides a larger range for the position and ensures consistency with the design, as position is also defined as a long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to clarify.

* @throws CSVException Thrown on invalid input.
*/
public CSVParser(final Reader reader, final CSVFormat format, final long characterOffset, final long recordNumber,
String encoding) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a Charset, not a charset name.

@DarrenJAN
Copy link
Contributor Author

Hi Gary,
I submitted another commit. Here are the changes that I made:

  1. Add augment the builder and add a private constructor to create CVSParser class
  2. Delete unneeded constructors in CSVformat class
  3. Use a try-with-resources block
  4. Improve the comments and fix the indentation
  5. Use a Charset instead of a string Charset name in ExtendedBufferedReader class

Thanks
Yuzhan

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for your updates. There is one adjustment to make where code is not needed because it is provided in the superclass.

src/main/java/org/apache/commons/csv/CSVParser.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/csv/CSVParser.java Outdated Show resolved Hide resolved
src/main/java/org/apache/commons/csv/CSVParser.java Outdated Show resolved Hide resolved
@DarrenJAN
Copy link
Contributor Author

Hi Gary,

Just submit another change.

Please note that due to the encoding algorithm used, it only supports UTF-8 for now. Full encoding can be supported; I just need to put more effort into this.

Thanks
Yuzhan

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Now that I look at this again, we are always using the Charset, which is not enough to only have the feature enabled as an opt-in. This is due to the Charset being used to read the file in the first place, it's likely that the Charset will be often set, so I think we need a boolean flag that says "recordByteCount". The flag should drive whether the feature is enabled or not, not the Charset. You still need the Charset, but that's not what enables the feature.

Does that make sense? I think adding a boolean recordByteCount to the builder is what's needed here. WDYT?

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

There are many comments that were incorrectly marked resolved. See also my comment about adding a boolean to drive the new opt-in behavior.

@DarrenJAN
Copy link
Contributor Author

Hi Gary,

Yes, that makes sense. I agree that relying solely on the Charset to enable the feature isn’t sufficient, as the Charset is often set independently of whether the feature should be enabled.

In our original design, instead of using a boolean, we used a String variable because it served a dual purpose: first, it controlled whether the feature was enabled or not, and second, it specified the Charset and the corresponding byte-counting algorithm to be used. Do you consider using a String?

@garydgregory
Copy link
Member

Hello @DarrenJAN

I don't understand how a String can work. Let's say I specify I want to read a CSV file encoded in Charset X. The Charset (encoder) that is used by the new counting feature MUST match X, otherwise the counting risks being mismatched. Or am I missing something? Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off.

If the PR only supports a subset of Charsets, then the Javadoc of the setter for the feature toggle must document this.

It's also likely that I am not seeing how the PR code only supports some Charsets and not others. What's missing?

TY.

@DarrenJAN
Copy link
Contributor Author

Hi Gary,

In ExtendedBufferedReader.java, this function getCharBytes(int current), the logic of this code is based on the UTF-8 encoding specification. We need extra implementation to support the full set of characters.

"Hence, the need for a boolean or some other type that's not a Charset, maybe even an Enum if there a need for something more than on and off."
--- We used a String here. We only did a simple check whether String is null to enable this encoder feature since our customers only asked for UTF-8 encoder. I agree with adding a flag to control this feature.

@DarrenJAN
Copy link
Contributor Author

Hi Gary, any thoughts on this?

@garydgregory
Copy link
Member

Hello @DarrenJAN
Yes, follow up on my comments to add a boolean feature toggle 😉

Adding a boolean to drive byte tracking opt-in behavior
@DarrenJAN
Copy link
Contributor Author

Hi @garydgregory,
I submitted another commit to add a boolean flag

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for your updates. Please see my comments.

@@ -153,6 +153,7 @@ public static class Builder extends AbstractStreamBuilder<CSVParser, Builder> {
private CSVFormat format;
private long characterOffset;
private long recordNumber = 1;
private boolean enableByteTracking = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't initialize to the default value.

@@ -144,6 +159,15 @@ public long getCharacterPosition() {
return characterPosition;
}

/**
* Returns the start byte of this record as a character byte in the source stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

lastChar = current;
position++;
return lastChar;
}

/**
* In Java, the {@code char} data type is based on the original Unicode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getter get, IOW, the Javadoc should start with "Gets ...".

@@ -103,6 +103,15 @@ long getCharacterPosition() {
return reader.getPosition();
}

/**
* Returns the number of bytes read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Returns" -> "Gets".


@Test
public void testGetRecordFourBytesRead() throws Exception {
String code = "id,a,b,c\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use final.

* Fix comments
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for updates.

I think we have the basic logic done but I have some additional change requests.

TY!

* @throws IOException
* If there is a problem reading the header or skipping the first record.
* @throws CSVException Thrown on invalid input.
* @since 1.13.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New private elements do not need a Javadoc since tag.

* @return the byte length of the character.
* @throws CharacterCodingException if the character cannot be encoded.
*/
private long getCharBytes(int current) throws CharacterCodingException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type here should be an int because this API returns 0 or CharBuffer.limit() which itself returns an int.

* @return the byte length of the character.
* @throws CharacterCodingException if the character cannot be encoded.
*/
private long getCharBytes(int current) throws CharacterCodingException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better method name here is getEncodedCharLength().

* @param recordNumber
* The next record number to assign.
* @param charset
* The character encoding to be used for the reader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The character encoding to be used for the reader."
->
"The character encoding to be used for the reader when enableByteTracking is true."

/**
* The start byte of this record as a character byte in the source stream.
*/
private final long characterByte;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is too confusing IMO, please rename to bytePosition to mirror the existing characterPosition.

@DarrenJAN
Copy link
Contributor Author

Hi @garydgregory,

Happy Holidays! The changes that I made:

  1. New private elements do not need a Javadoc since tag.
  2. Return type of getCharByte is int
  3. Change getCharBytes to getEncodedCharLength
  4. Add Javadoc @param for enableByteTracking
  5. CSVRecord: Remove the old constructors and adapt the existing test.
  6. Change characterByte to bytePosition @yunzvanessa How do you think this change?

Test:
mvn
image

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN

Thank you for updating the PR.

You missed a couple of my comments regarding documentation.

Please rebase on git master to pickup updates that will automatically show you where you missed using final and other Checkstyle updates. You'll have to resolve conflicts in one test file.

TY & happy hols!

@@ -48,6 +48,11 @@ public final class CSVRecord implements Serializable, Iterable<String> {
*/
private final long characterPosition;

/**
* The start byte of this record as a character byte in the source stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to clarify.

/**
* Returns the start byte of this record as a character byte in the source stream.
*
* @return the start byte of this record as a character byte in the source stream.
Copy link
Member

@garydgregory garydgregory Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public and protected elements should have a Javadoc tag of @since 1.13.0.
Please clarify the comment: Specifically document this data as "position".

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @DarrenJAN
I cannot edit files in this PR, which is unusual; otherwise, I would have added the missing Javadoc @since tags. Please do so and see my unresolved comments.
TY!

@DarrenJAN
Copy link
Contributor Author

Hi @garydgregory, I checked and made the enhancements:

  1. git rebase master
  2. For new public and protected methods, add javadoc since 1.13.0
  3. Clarify the comment:

Test:
mvn
image

@garydgregory garydgregory merged commit b40039b into apache:master Jan 2, 2025
8 of 9 checks passed
@garydgregory garydgregory changed the title CSV-196-TrackBytePositions [CSV-196] Track byte position Jan 2, 2025
asfgit pushed a commit that referenced this pull request Jan 2, 2025
@garydgregory
Copy link
Member

@DarrenJAN
Copy link
Contributor Author

Hi @garydgregory ,

Thank you very much. When is the next release?

Yuzhan

@garydgregory
Copy link
Member

garydgregory commented Jan 3, 2025

Hello @DarrenJAN
I have three open release candidates ATM, so when that's done, I'll start the release process for this component. In the meantime, please test a build from our Maven snapshot repository:

https://repository.apache.org/content/repositories/snapshots/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants