-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2415: Reuse hadoop file status and footer in ParquetRecordReader #1242
base: master
Are you sure you want to change the base?
Changes from all commits
c03b466
66c4cd4
99b43b8
0321a2d
7817f4a
92befc2
8bca57b
4bcf880
d86d9c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -95,6 +95,11 @@ | |
<artifactId>jackson-databind</artifactId> | ||
<version>${jackson-databind.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>${jackson.groupId}</groupId> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we avoid adding dependency? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The jackson-annotations dependency is used in parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java . Do not serialize There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I happened to find that we have a parquet-jackson module which shades jackson-core and jackson-databind. But in the parquet-hadoop (and other modules) it also explicitly depends on parquet-jackson and jackson-xxx at the same time. I'm not familiar with this history, do you know why? @gszadovszky @Fokko @shangxinli There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! Sorry for missing that. |
||
<artifactId>jackson-annotations</artifactId> | ||
<version>${jackson-databind.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.xerial.snappy</groupId> | ||
<artifactId>snappy-java</artifactId> | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,6 +18,7 @@ | |
*/ | ||
package org.apache.parquet.hadoop; | ||
|
||
import com.fasterxml.jackson.annotation.JsonIgnore; | ||
import java.io.ByteArrayInputStream; | ||
import java.io.ByteArrayOutputStream; | ||
import java.io.DataInput; | ||
|
@@ -36,6 +37,7 @@ | |
import org.apache.hadoop.mapreduce.lib.input.FileSplit; | ||
import org.apache.parquet.hadoop.metadata.BlockMetaData; | ||
import org.apache.parquet.hadoop.metadata.ColumnChunkMetaData; | ||
import org.apache.parquet.hadoop.metadata.ParquetMetadata; | ||
import org.apache.parquet.schema.MessageType; | ||
import org.apache.parquet.schema.MessageTypeParser; | ||
|
||
|
@@ -55,6 +57,9 @@ public class ParquetInputSplit extends FileSplit implements Writable { | |
private long end; | ||
private long[] rowGroupOffsets; | ||
|
||
@JsonIgnore | ||
private volatile ParquetMetadata footer; | ||
|
||
/** | ||
* Writables must have a parameterless constructor | ||
*/ | ||
|
@@ -222,6 +227,14 @@ public long[] getRowGroupOffsets() { | |
return rowGroupOffsets; | ||
} | ||
|
||
public ParquetMetadata getFooter() { | ||
return footer; | ||
} | ||
|
||
public void setFooter(ParquetMetadata footer) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now parquet will convert the input split to That will be a big change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this PR can be a good reason to push the spark community to migrate. Or we can fix this in only spark 4.x? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Already filed a WIP ticket apache/spark#44853 for spark 4 and will discuss about this change in spark side in that PR after this PR is merged. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, other comments have suggested that we should not work on a deprecated interface. Therefore I don't expect this PR will be merged as is. It would be good to figure out the final solution on the spark side before any action here. |
||
this.footer = footer; | ||
} | ||
|
||
@Override | ||
public String toString() { | ||
String hosts; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -46,6 +46,7 @@ | |
import org.apache.parquet.hadoop.util.ContextUtil; | ||
import org.apache.parquet.hadoop.util.HadoopInputFile; | ||
import org.apache.parquet.hadoop.util.counters.BenchmarkCounter; | ||
import org.apache.parquet.io.InputFile; | ||
import org.apache.parquet.io.ParquetDecodingException; | ||
import org.slf4j.Logger; | ||
import org.slf4j.LoggerFactory; | ||
|
@@ -155,8 +156,13 @@ private void initializeInternalReader(ParquetInputSplit split, Configuration con | |
} | ||
|
||
// open a reader with the metadata filter | ||
ParquetFileReader reader = | ||
ParquetFileReader.open(HadoopInputFile.fromPath(path, configuration), optionsBuilder.build()); | ||
InputFile inputFile; | ||
if (split.getFooter() != null && split.getFooter().getInputFile() != null) { | ||
inputFile = split.getFooter().getInputFile(); | ||
} else { | ||
inputFile = HadoopInputFile.fromPath(path, configuration); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if the filestatus (or at least file length) can get down here then it becomes possible to skip a HEAD request when opening a file against cloud storage. the api you need is in 3.3.0, and not very reflection friendly. we could add something to assist there. what is key is: get as much info as possible into HadoopInputFile, especially expected length |
||
} | ||
ParquetFileReader reader = new ParquetFileReader(inputFile, optionsBuilder.build(), split.getFooter()); | ||
|
||
if (rowGroupOffsets != null) { | ||
// verify a row group was found for each offset | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,11 +18,13 @@ | |
*/ | ||
package org.apache.parquet.hadoop.metadata; | ||
|
||
import com.fasterxml.jackson.annotation.JsonIgnore; | ||
import com.fasterxml.jackson.databind.ObjectMapper; | ||
import java.io.IOException; | ||
import java.io.StringReader; | ||
import java.io.StringWriter; | ||
import java.util.List; | ||
import org.apache.parquet.io.InputFile; | ||
|
||
/** | ||
* Meta Data block stored in the footer of the file | ||
|
@@ -84,6 +86,9 @@ public static ParquetMetadata fromJSON(String json) { | |
private final FileMetaData fileMetaData; | ||
private final List<BlockMetaData> blocks; | ||
|
||
@JsonIgnore | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this annotation required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The jackson mapper will not serialize this field to json with this annotation and keep the the same behavior as before. |
||
private volatile InputFile inputFile; | ||
|
||
/** | ||
* @param fileMetaData file level metadata | ||
* @param blocks block level metadata | ||
|
@@ -107,6 +112,22 @@ public FileMetaData getFileMetaData() { | |
return fileMetaData; | ||
} | ||
|
||
/** | ||
* Reuse the inputFile in ParquetFileReader if it is not null | ||
* @return | ||
*/ | ||
public InputFile getInputFile() { | ||
wgtmac marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return inputFile; | ||
} | ||
|
||
/** | ||
* | ||
* @param inputFile Cache the inputFile in readFooter method and reuse it in ParquetFileReader | ||
*/ | ||
public void setInputFile(InputFile inputFile) { | ||
this.inputFile = inputFile; | ||
} | ||
|
||
@Override | ||
public String toString() { | ||
return "ParquetMetaData{" + fileMetaData + ", blocks: " + blocks + "}"; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any better way to do this? What about annotating the
getInputFile
at https://github.com/apache/parquet-mr/pull/1242/files#diff-69ea2eba95ed65129e4a4a9e5807807a3c138ea39f2eddf5d6f6b8f2a3b51c73R115?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1242 (comment)
I'm sorry, the UT failed, I don't know why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean this is a workaround to get rid of the test failure at the cost of a new dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the UT will fail if just annotating the
getInputFile
method, create a MixIn class here (parquet-cli module) to workaround.Parquet project already has a dependency of
jackson-annotations
library in some other modules. So I don't think this PR will add a new dependency in parquet-hadoop module.