Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1749: Fix supportVectoredIO for hadoop version string with optional patch labels #1990

Closed
wants to merge 1 commit into from

Conversation

otorreno
Copy link
Contributor

@otorreno otorreno commented Jul 25, 2024

What changes were proposed in this pull request?

Parse correctly semantic versioning when there is the optional labels in the patch are present.

Why are the changes needed?

There are cases where the hadoop version info may not be respecting the semantic versioning. It is the case for the hadoop version provided in some of the AWS managed services. This causes a ExceptionInInitializerError while trying to instantiate and ORC file reader.

How was this patch tested?

Included unit tests. It required a small refactor to be able to test it in an easier way

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the JAVA label Jul 25, 2024
@otorreno otorreno force-pushed the fix/supportVectoredIO branch from 99f9b15 to 0a06fd1 Compare July 25, 2024 21:31
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR, @otorreno .

Do you think you can provide a test case to give us more context, @otorreno ?

// There could be cases where we are not able to parse the version
// Here we are defaulting to the non vectoredIO if that is the case

@otorreno
Copy link
Contributor Author

otorreno commented Jul 25, 2024

Thank you for making a PR, @otorreno .

Do you think you can provide a test case to give us more context, @otorreno ?

// There could be cases where we are not able to parse the version
// Here we are defaulting to the non vectoredIO if that is the case

The version I am getting from AWS for hadoop is 3.3.6-amzn-3, so we fail parsing the patch

@otorreno
Copy link
Contributor Author

@dongjoon-hyun I can certainly write a test case, but I would have to pass down the VersionInfo and do a little refactor. Let me know if you would like me to do that

@otorreno otorreno force-pushed the fix/supportVectoredIO branch from 0a06fd1 to a2afdd0 Compare July 25, 2024 22:08
@otorreno
Copy link
Contributor Author

@dongjoon-hyun just pushed the tests

@otorreno otorreno force-pushed the fix/supportVectoredIO branch from a2afdd0 to a8180da Compare July 25, 2024 22:13
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.3.6-amzn-3 looks like reasonable requests which we had better handle because it follows semantic versioning.

Can we make this proposal more compatible with Semantic Versioning? For example, 3.3.6-amzn-3 has HADOOP-18103 if they follows semantic versioning. So, I believe supportVectoredIO should return true.

Could you make a change in a way to consider Semantic Versioning?

int minor = Integer.parseInt(versionParts[1]);
int patch = Integer.parseInt(versionParts[2]);
return major == 3 && (minor > 3 || (minor == 3 && patch > 4));
default boolean supportVectoredIO(String version) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is okay for me.

@otorreno otorreno force-pushed the fix/supportVectoredIO branch from a8180da to 7a7c68d Compare July 25, 2024 22:23
@otorreno
Copy link
Contributor Author

3.3.6-amzn-3 looks like reasonable requests which we had better handle because it follows semantic versioning.

Can we make this proposal more compatible with Semantic Versioning? For example, 3.3.6-amzn-3 has HADOOP-18103 if they follows semantic versioning. So, I believe supportVectoredIO should return true.

Could you make a change in a way to consider Semantic Versioning?

Done @dongjoon-hyun

@dongjoon-hyun
Copy link
Member

As a last piece, please update the PR title and description according to your latest code change.

Here, there is no bad hadoop version string.

@dongjoon-hyun dongjoon-hyun added this to the 2.0.2 milestone Jul 25, 2024
@otorreno otorreno changed the title Fix supportVectoredIO for bad hadoop version string Fix supportVectoredIO for hadoop version string with optional patch labels Jul 25, 2024
@dongjoon-hyun dongjoon-hyun changed the title Fix supportVectoredIO for hadoop version string with optional patch labels Fix supportVectoredIO for hadoop version string with optional patch labels Jul 25, 2024
@otorreno
Copy link
Contributor Author

As a last piece, please update the PR title and description according to your latest code change.

Here, there is no bad hadoop version string.

you caught me doing that :). Done

@dongjoon-hyun dongjoon-hyun changed the title Fix supportVectoredIO for hadoop version string with optional patch labels ORC-1749: Fix supportVectoredIO for hadoop version string with optional patch labels Jul 25, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (Pending CIs).

BTW, I created ORC-1749 to track this. Do you happen to have Apache JIRA ID, @otorreno ?

If then, I can add you to the Apache ORC contributor group and assigned ORC-1749 to you as a recognition of this contribution.

@otorreno
Copy link
Contributor Author

+1, LGTM (Pending CIs).

BTW, I created ORC-1749 to track this. Do you happen to have Apache JIRA ID, @otorreno ?

I do not. I guess I should create one

@dongjoon-hyun
Copy link
Member

Yes, please apply one. I can approve the ID creation quickly for you.

@@ -48,7 +49,7 @@
*/
public class RecordReaderUtils {
private static final HadoopShims SHIMS = HadoopShimsFactory.get();
private static final boolean supportVectoredIO = SHIMS.supportVectoredIO();
private static final boolean supportVectoredIO = SHIMS.supportVectoredIO(VersionInfo.getVersion());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make CI happy?

[INFO] There is 1 error reported by Checkstyle 10.17.0 with checkstyle.xml ruleset.
Error:  src/java/org/apache/orc/impl/RecordReaderUtils.java:[52] (sizes) LineLength: Line is longer than 100 characters (found 101).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see now.

@otorreno
Copy link
Contributor Author

Yes, please apply one. I can approve the ID creation quickly for you.

I have requested one with oscartorreno as username.

@dongjoon-hyun
Copy link
Member

Ya, I approved you request.

And, I changed the reporter field of ORC-1749 to oscartorreno. :)

When this PR is merged, the assignee field will be filled with oscartorreno too.

There are cases where the hadoop version info may not be respecting
the semantic versioning. It is the case for the hadoop version
provided in some of the AWS managed services. This causes
a ExceptionInInitializerError while trying to instantiate and ORC
file reader.

Fixing this by defaulting to non vectored IO in case the semantic
versioning is not respected
@otorreno otorreno force-pushed the fix/supportVectoredIO branch from 7a7c68d to 36bfffe Compare July 25, 2024 22:42
dongjoon-hyun pushed a commit that referenced this pull request Jul 25, 2024
…onal patch labels

### What changes were proposed in this pull request?
Parse correctly semantic versioning when there is the optional labels in the patch are present.

### Why are the changes needed?
There are cases where the hadoop version info may not be respecting the semantic versioning. It is the case for the hadoop version provided in some of the AWS managed services. This causes a ExceptionInInitializerError while trying to instantiate and ORC file reader.

### How was this patch tested?
Included unit tests. It required a small refactor to be able to test it in an easier way

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1990 from otorreno/fix/supportVectoredIO.

Authored-by: Oscar Torreno <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 37201cb)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Welcome to the Apache ORC community, @otorreno .

You are added to the Apache ORC contributor group and ORC-1749 is assigned to you.

This is a part of Apache ORC 2.0.2 release milestone and will be released in 3 weeks before August 15th. In addition, this will be a part of Apache Spark 4.0.0-preview2 in a near future.

Thank you again and congratulations for your first commit to ASF repository.

dongjoon-hyun added a commit to apache/spark that referenced this pull request Aug 16, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 2.0.2 for Apache Spark 4.0.0.

### Why are the changes needed?

To bring the latest maintenance release with bug fixes.
- https://orc.apache.org/news/2024/08/15/ORC-2.0.2/
  - apache/orc#1989
  - apache/orc#1990

### Does this PR introduce _any_ user-facing change?

No. This is not released.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47774 from dongjoon-hyun/SPARK-49251.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 2.0.2 for Apache Spark 4.0.0.

### Why are the changes needed?

To bring the latest maintenance release with bug fixes.
- https://orc.apache.org/news/2024/08/15/ORC-2.0.2/
  - apache/orc#1989
  - apache/orc#1990

### Does this PR introduce _any_ user-facing change?

No. This is not released.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47774 from dongjoon-hyun/SPARK-49251.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 2.0.2 for Apache Spark 4.0.0.

### Why are the changes needed?

To bring the latest maintenance release with bug fixes.
- https://orc.apache.org/news/2024/08/15/ORC-2.0.2/
  - apache/orc#1989
  - apache/orc#1990

### Does this PR introduce _any_ user-facing change?

No. This is not released.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47774 from dongjoon-hyun/SPARK-49251.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 2.0.2 for Apache Spark 4.0.0.

### Why are the changes needed?

To bring the latest maintenance release with bug fixes.
- https://orc.apache.org/news/2024/08/15/ORC-2.0.2/
  - apache/orc#1989
  - apache/orc#1990

### Does this PR introduce _any_ user-facing change?

No. This is not released.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47774 from dongjoon-hyun/SPARK-49251.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants