-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2415: Reuse hadoop file status and footer in ParquetRecordReader #1242
base: master
Are you sure you want to change the base?
Conversation
6ee8649
to
10d74ed
Compare
10d74ed
to
0321a2d
Compare
I can not reproduce the failed UT:
If any one could help check this issue ? |
Have you tried to run mvn install before running the cli test? It may run the test with dependency from maven central without your patch. |
0a9c353
to
8bca57b
Compare
Hi, @wgtmac I have fixed this issue in cli module. Thanks |
@@ -84,6 +86,9 @@ public static ParquetMetadata fromJSON(String json) { | |||
private final FileMetaData fileMetaData; | |||
private final List<BlockMetaData> blocks; | |||
|
|||
@JsonIgnore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this annotation required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jackson mapper will not serialize this field to json with this annotation and keep the the same behavior as before.
HadoopInputFile inputFile; | ||
if (split.getFooter() != null | ||
&& split.getFooter().getInputFile() != null | ||
&& split.getFooter().getInputFile() instanceof HadoopInputFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @amousavigourabi to see if there is any chance to apply this to the non-hadoop code path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @wgtmac, given that the inputFile
variable seems to only be used in a constructor expecting an InputFile
and not necessarily a HadoopInputFile
, I think this instanceof
condition could be dropped. As I just quickly skimmed it now and might have missed something, I'll take a more thorough look after Boxing Day. Happy holidays!🎄🎆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wgtmac @amousavigourabi for your review.
I have changed the HadoopInputFile
to InputFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Going back to @wgtmac's earlier concern, the method this snippet is part of is already in the Hadoop code path and I'm not sure whether there is a more generic alternative available. For the rest, the switch to using the plain InputFile
interface here is of course amazing for flexibility in the future and makes the code a bit cleaner. Thanks a lot for the swift fix @wankunde!
Hi, @wgtmac @amousavigourabi is there any concern about this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on my side.
I think this should be a very common requirement and not sure if the community has discussed this before. cc @gszadovszky @shangxinli @Fokko @ConeyLiu
Hi @wankunde , sorry for the delayed response. I don't see any blockers on my side and love the patch, so its a +1 from me. |
@@ -95,6 +95,11 @@ | |||
<artifactId>jackson-databind</artifactId> | |||
<version>${jackson-databind.version}</version> | |||
</dependency> | |||
<dependency> | |||
<groupId>${jackson.groupId}</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid adding dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jackson-annotations dependency is used in parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java . Do not serialize InputFile inputFile
to json and keep the the same behavior as before. I'm sorry I'm not familiar with jackson library and not sure is there any other way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happened to find that we have a parquet-jackson module which shades jackson-core and jackson-databind. But in the parquet-hadoop (and other modules) it also explicitly depends on parquet-jackson and jackson-xxx at the same time. I'm not familiar with this history, do you know why? @gszadovszky @Fokko @shangxinli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for missing that.
return footer; | ||
} | ||
|
||
public void setFooter(ParquetMetadata footer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ParquetInputSplit
is marked as deprecated. And the recommended usage is FileSplit
. How does Spark set the footer
after the ParquetInputSplit
is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now parquet will convert the input split to ParquetInputSplit
and build the reader with it. I think if ParquetInputSplit
was removed from ParquetFileReader class, spark need a shim class to work with different parquet version.
That will be a big change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR can be a good reason to push the spark community to migrate. Or we can fix this in only spark 4.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already filed a WIP ticket apache/spark#44853 for spark 4 and will discuss about this change in spark side in that PR after this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, other comments have suggested that we should not work on a deprecated interface. Therefore I don't expect this PR will be merged as is. It would be good to figure out the final solution on the spark side before any action here.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java
Show resolved
Hide resolved
@@ -64,13 +65,19 @@ public int run() throws IOException { | |||
return 0; | |||
} | |||
|
|||
abstract class MixIn { | |||
@JsonIgnore | |||
abstract int getInputFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better way to do this? What about annotating the getInputFile
at https://github.com/apache/parquet-mr/pull/1242/files#diff-69ea2eba95ed65129e4a4a9e5807807a3c138ea39f2eddf5d6f6b8f2a3b51c73R115?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, the UT failed, I don't know why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean this is a workaround to get rid of the test failure at the cost of a new dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the UT will fail if just annotating the getInputFile
method, create a MixIn class here (parquet-cli module) to workaround.
Parquet project already has a dependency of jackson-annotations
library in some other modules. So I don't think this PR will add a new dependency in parquet-hadoop module.
+1 for the concept. We need to address that |
if (split.getFooter() != null && split.getFooter().getInputFile() != null) { | ||
inputFile = split.getFooter().getInputFile(); | ||
} else { | ||
inputFile = HadoopInputFile.fromPath(path, configuration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the filestatus (or at least file length) can get down here then it becomes possible to skip a HEAD request when opening a file against cloud storage. the api you need is in 3.3.0, and not very reflection friendly. we could add something to assist there.
what is key is: get as much info as possible into HadoopInputFile, especially expected length
Make sure you have checked all steps below.
Jira
them in the PR title. For example, "PARQUET-2415: Reuse hadoop file status and footer in ParquetRecordReader"
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-plugins
Documentation
ParquetInputSplit.setFooter(footer)
before creating a ParquetRecordReader from this split, then the reader will reuse the Hadoop file status and skip the getfileinfo Hadoop RPC and skip reading the footer again.