[SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader #39950

yabola · 2023-02-09T08:48:53Z

What changes were proposed in this pull request?

Parquet footer metadata is now always read twice in vectorized parquet reader.
When the NameNode is under high pressure, it will cost time to read twice. Actually we can avoid reading the footer twice by reading all row groups in advance and filter row groups according to filters that require push down (no need to read the footer metadata again the second time).

Why are the changes needed?

Reduce the reading of footer in vectorized parquet reader

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

yabola · 2023-02-09T09:00:41Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -207,11 +207,11 @@ class ParquetFileFormat

      lazy val footerFileMetaData =
        ParquetFooterReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
-      val datetimeRebaseSpec = DataSourceUtils.datetimeRebaseSpec(
+      lazy val datetimeRebaseSpec = DataSourceUtils.datetimeRebaseSpec(
        footerFileMetaData.getKeyValueMetaData.get,


footerFileMetaData is lazy, but datetimeRebaseSpec causes the footer to be read immediately.
Actually we can avoid this unnecessary footer reads and use footer metadata in VectorizedParquetRecordReader

yabola · 2023-02-13T01:29:32Z

@MaxGekk @gengliangwang If you have time, please take a look, thanks

dongjoon-hyun · 2023-02-17T04:45:34Z

cc @sunchao , too.

sunchao

Thanks @yabola . I feel this PR is not that useful though since in most cases there will be filters pushed down to Parquet. Instead, a better approach IMO is to introduce another constructor (there is one but is marked as deprecated) on ParquetFileReader which takes a footer as input, so it doesn't need to read it again. We can then pass the footer obtained in ParquetFileFormat to it via the ctor of VectorizedParquetRecordReader.

We also should support this for data source V2, in ParquetPartitionReaderFactory.

yabola · 2023-02-21T03:02:27Z

@sunchao Thank you for your reply!
Yes, I also noticed this, just before is for minimal changes. In the original implementation:

The first footerFileMetaData use SKIP_ROW_GROUPS option (SkipMetadataFilter, return meta without rowGroup);
The second footerFileMetaData use RangeMetadataFilter(return meta with rowGroup info).

Actually the second footerFileMetaData contains all information used in the first footerFileMetaData(the detail implementation difference can see ParquetMetadataConverter#readParquetMetadata)

So when in case that we need filter pushdown and also enableVectorizedReader, we can only create one ParquetFileReader and read parquet footer only once. Other situations can also be optimized when reading footer.
This needs to modify some more codes, do you think it is suitable?

sunchao · 2023-02-22T17:47:10Z

@yabola yes, we'll need to use RangeMetadataFilter (i.e.: HadoopReadOptions.builder().withRange()) when we initially read the footer. This is possible since in places like ParquetFileFormat we already have a PartitionedFile which is just a segment in a Parquet file with a start and length.

The only problem is we need new non-deprecated API from parquet-mr to support this use case. Personally I think we can just use the deprecated API for now, and replace it after a new Parquet version is released.

yabola · 2023-03-02T14:03:05Z

@sunchao Sorry, it might be necessary to read footer twice if having filters. We should read schema in footer meta first to get which filters need to be pushed down. After that we set pushdown info ((https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L261) and read filtered RowGroups with the filter configuration.

But we can reduce footer reading when no filter is needed pushdown.
It will be useful when scanning joined tables (filter is only on the other side of the join table). It can reduce much footer reading when there are many joined tables and filter conditions are on few tables.

yabola · 2023-03-12T14:34:20Z

@sunchao Hi~ Could you take a look at this PR? I think it will be useful when there are joined tables and filter conditions are on few tables.

sunchao · 2023-03-13T04:56:26Z

Sorry, it might be necessary to read footer twice if having filters. We should read schema in footer meta first to get which filters need to be pushed down. After that we set pushdown info ((https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L261) and read filtered RowGroups with the filter configuration.

(sorry for the late reply!) Hmm why? when we read the footer for the first time, the result already contain all the row groups. We just need to pass these to ParquetFileReader, which will apply the filters we pushed down on these row groups and return a list of filtered ones. See here.

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

yabola · 2023-03-16T11:12:34Z

@sunchao please take a look, thank you

sunchao

We also need to support V2 data source, e.g., ParquetPartitionReaderFactory

sunchao · 2023-03-17T21:37:23Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -205,11 +212,21 @@ class ParquetFileFormat

      val sharedConf = broadcastedHadoopConf.value.value

-      lazy val footerFileMetaData =
-        ParquetFooterReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
+      val fileRange = HadoopReadOptions.builder(sharedConf, split.getPath)


can we add these in ParquetFooterReader? we may need to use try-with-resources clause to make sure resources are properly closed.

we can just obtain the footer here and use it later for footerFileMetaData and pass it to VectorizedParquetRecordReader

Yes, before I created and passed ParquetFileReader because I wanted to create one less file.newStream() in it (if there is no filter pushdown), but it doesn’t seem to make much sense. I have changed to pass footer here.
Please take a look, thank you!

sunchao · 2023-03-17T21:54:23Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

@@ -279,7 +301,7 @@ class ParquetFileFormat
        // Instead, we use FileScanRDD's task completion listener to close this iterator.
        val iter = new RecordReaderIterator(vectorizedReader)
        try {
-          vectorizedReader.initialize(split, hadoopAttemptContext)
+          vectorizedReader.initialize(split, hadoopAttemptContext, fileReader)


can we pass footer to initialize instead?

yabola · 2023-03-21T15:01:47Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+      // Read all the row groups in advance and filter the row groups later if there are
+      // filters that need push down.
+      ParquetFooterReader.readFooter(conf, split, ParquetFooterReader.WITH_ROW_GROUPS)
+    } else {


I think we can use

if (aggregation.isDefined || enableVectorizedReader) { ParquetFooterReader.readFooter(conf, split, ParquetFooterReader.WITH_ROW_GROUPS) }

it looks we can, since when aggregation push down is enabled, ParquetScan.isSplitable returns false, and we'll always read all the row groups in the file, so NO_FILTER is the same as WITH_ROW_GROUPS.

sunchao

Thanks @yabola ! looks much better now. Have a few more nits and I think we are close to merge after that.

sunchao · 2023-03-29T16:42:56Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+    } else {
+      filter = HadoopReadOptions.builder(configuration, split.getPath())
+          .withRange(split.getStart(), split.getStart() + split.getLength())
+          .withCodecFactory(new ParquetCodecFactory(configuration, 0))


hmm is this required? we don't need the codec factory when reading footer?

Thank you for review. Yes, if just read footer, there is no need to bring

sunchao · 2023-03-29T16:50:06Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+  public static final boolean WITH_ROW_GROUPS = false;
+
+  /**
+   * method to read parquet file footer


nit: How about:

Reads footer for the input Parquet file 'split'. If 'skipRowGroup' is true, this will skip reading the Parquet row group metadata.

sunchao · 2023-03-29T16:50:23Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+   *                     if false, read row groups according to the file range
+   */
+  public static ParquetMetadata readFooter(
+    Configuration configuration,


nit: 4 space indent

sunchao · 2023-03-29T16:50:32Z

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

+  }
+
+  public void initialize(
+    InputSplit inputSplit,


nit: 4 space indent

sunchao · 2023-03-29T16:51:02Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

@@ -181,6 +184,16 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptCont
    initializeInternal();
  }

+  @Override
+  public void initialize(
+    InputSplit inputSplit,


nit: 4 space indent

sunchao · 2023-03-29T16:51:16Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

-      lazy val footerFileMetaData =
-        ParquetFooterReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
+      val fileFooter = if (enableVectorizedReader) {
+        // This can avoid reading the footer twice(currently only optimize for vectorized read).


nit: space after "twice"

not addressed yet

I updated comments, thank you

sunchao · 2023-03-29T16:59:56Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+   */
+  public static ParquetMetadata readFooter(
+    Configuration configuration,
+    FileSplit split,


instead of using FileSplit, can we just pass PartitionedFile here? FileSplit is from org.apache.hadoop.mapred which is really out-dated.

done, I have changed to pass PartitionedFile

sunchao · 2023-03-29T17:06:46Z

...la/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetPartitionReaderFactory.scala

+      // Read all the row groups in advance and filter the row groups later if there are
+      // filters that need push down.
+      ParquetFooterReader.readFooter(conf, split, ParquetFooterReader.WITH_ROW_GROUPS)
+    } else {


it looks we can, since when aggregation push down is enabled, ParquetScan.isSplitable returns false, and we'll always read all the row groups in the file, so NO_FILTER is the same as WITH_ROW_GROUPS.

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

sunchao

Looks good, thanks again @yabola .

I think this PR now depends on #40555 . Once that is merged, I'll approve & merge this.

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java

sunchao · 2023-03-31T20:04:53Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala

-      lazy val footerFileMetaData =
-        ParquetFooterReader.readFooter(sharedConf, filePath, SKIP_ROW_GROUPS).getFileMetaData
+      val fileFooter = if (enableVectorizedReader) {
+        // This can avoid reading the footer twice(currently only optimize for vectorized read).


not addressed yet

… vectorized reader

sunchao

LGTM except one nit

sunchao · 2023-04-15T17:53:58Z

...re/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

+      Configuration configuration,
+      PartitionedFile partitionedFile,
+      boolean skipRowGroup) throws IOException {
+    FileSplit split = new FileSplit(partitionedFile.toPath(), partitionedFile.start(),


nit: I think there is no need to use FileSplit and hence depend on org.apache.hadoop.mapred here. Instead we can do:

long start = file.start(); long length = file.length(); Path filePath = new Path(new URI(file.filePath().toString()));

Good idea, done

sunchao · 2023-04-16T04:10:28Z

Merged to master, thanks!

yabola · 2023-04-16T06:49:09Z

@sunchao Thank you for your detailed review!

sadikovi · 2023-04-24T04:00:39Z

@yabola @sunchao Could you share any benchmark numbers for the second optimisation of reading all row groups for each task? My concern is that it could be suboptimal in performance when you have, let's say, 100 row groups in a file, you create 100 tasks for each row group but then you read the full footer with all of the row groups for every task just to process one row group.

yabola · 2023-04-24T04:39:28Z

@sadikovi Thanks for your advice. I would do a benchmark in the scenario you describe latter. My understanding is that I didn't read all the RowGroups, but included the range filter (each task has its own start and end in the FileSplit, may contain multiple RowGroups).

spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetFooterReader.java

Lines 61 to 64 in 0515e6b

    
           filter = HadoopReadOptions.builder(configuration, file.toPath()) 
        
               .withRange(fileStart, fileStart + file.length()) 
        
               .build() 
        
               .getMetadataFilter();

But it does not contain information of filter pushdown (block meta reading seems inevitable here, because without reading, it is impossible to know whether it is needed? ). I will follow up to confirm your concerns latter.

sunchao · 2023-04-24T17:29:20Z

Yea @yabola is correct, if we have 100 row groups in a file and there are 100 tasks to read them, each task will only be assigned a range (e.g., a single row group) in the file to read, so it won't read metadata for all the row groups in the file.

yabola · 2023-05-17T08:40:41Z

@sadikovi I have tested the scenario as you said. The smaller the row group size (footer size will be larger), the higher the PR benefits.
Environment: parquet.block.size=10240 (10kb), file size is 253.6 MB( one file has about 25000 row groups), and I wrote 24 files.
Before this PR:

After this PR:

…der (apache#1877) Parquet footer metadata is now always read twice in vectorized parquet reader. When the NameNode is under high pressure, it will cost time to read twice. Actually we can avoid reading the footer twice by reading all row groups in advance and filter row groups according to filters that require push down (no need to read the footer metadata again the second time). Reduce the reading of footer in vectorized parquet reader no existing tests Closes apache#39950 from yabola/skip_footer. Authored-by: chenliang.lu <[email protected]> Signed-off-by: Chao Sun <[email protected]> Co-authored-by: chenliang.lu <[email protected]>

dongjoon-hyun · 2024-07-31T15:23:10Z

Hi, @yabola and @sunchao .

SPARK-48950 seems to report a correctness issue about this. When you have some time, could you check it, please?

yabola · 2024-08-01T12:08:30Z

@dongjoon-hyun I will look into this later.

github-actions bot added the SQL label Feb 9, 2023

yabola changed the title ~~SPARK-42388 Avoid unnecessary parquet footer reads when no filters in vectorized reader~~ [SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader Feb 9, 2023

yabola commented Feb 9, 2023

View reviewed changes

yabola marked this pull request as draft February 9, 2023 16:06

yabola changed the title ~~[SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader~~ [WIP][SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader Feb 10, 2023

yabola marked this pull request as ready for review February 11, 2023 06:40

yabola changed the title ~~[WIP][SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader~~ [SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader Feb 11, 2023

yabola changed the title ~~[SPARK-42388][SQL] Avoid unnecessary parquet footer reads when no filters in vectorized reader~~ [SPARK-42388][SQL] Avoid parquet footer reads when no filters in vectorized reader Feb 11, 2023

yabola changed the title ~~[SPARK-42388][SQL] Avoid parquet footer reads when no filters in vectorized reader~~ [SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader Feb 16, 2023

sunchao reviewed Feb 20, 2023

View reviewed changes

yabola marked this pull request as draft March 2, 2023 02:24

yabola marked this pull request as ready for review March 2, 2023 14:17

yabola commented Mar 16, 2023

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala Outdated Show resolved Hide resolved

sunchao reviewed Mar 17, 2023

View reviewed changes

yabola marked this pull request as draft March 20, 2023 16:37

yabola marked this pull request as ready for review March 21, 2023 02:58

yabola mentioned this pull request Mar 21, 2023

test reading footer within file range #40495

Closed

yabola commented Mar 21, 2023

View reviewed changes

sunchao reviewed Mar 29, 2023

View reviewed changes

yabola commented Mar 30, 2023

View reviewed changes

...java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java Show resolved Hide resolved

sunchao reviewed Mar 31, 2023

View reviewed changes

yabola changed the title ~~[SPARK-42388][SQL] Avoid parquet footer reads twice when no filters in vectorized reader~~ [SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader Apr 1, 2023

SPARK-42388 Avoid unnecessary parquet footer reads when no filters in…

e6d1e89

… vectorized reader

sunchao approved these changes Apr 15, 2023

View reviewed changes

remove FileSplit

8174731

sunchao closed this in 52c1068 Apr 16, 2023

wankunde mentioned this pull request Jan 23, 2024

[WIP][SQL] Avoid parquet footer reads twice #44853

Closed

turboFei mentioned this pull request Oct 10, 2024

PARQUET-3031: Support to transfer input stream when building ParquetFileReader apache/parquet-java#3030

Merged

[SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader #39950

[SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader #39950

Conversation

yabola commented Feb 9, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yabola Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

yabola commented Feb 13, 2023

dongjoon-hyun commented Feb 17, 2023

sunchao left a comment

Choose a reason for hiding this comment

yabola commented Feb 21, 2023 • edited Loading

sunchao commented Feb 22, 2023

yabola commented Mar 2, 2023 • edited Loading

yabola commented Mar 12, 2023

sunchao commented Mar 13, 2023 • edited Loading

yabola commented Mar 16, 2023

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yabola Apr 16, 2023 • edited Loading

Choose a reason for hiding this comment

sunchao commented Apr 16, 2023

yabola commented Apr 16, 2023

sadikovi commented Apr 24, 2023

yabola commented Apr 24, 2023 • edited Loading

sunchao commented Apr 24, 2023

yabola commented May 17, 2023 • edited Loading

dongjoon-hyun commented Jul 31, 2024

yabola commented Aug 1, 2024 • edited Loading

yabola commented Feb 9, 2023 •

edited

Loading

yabola Feb 9, 2023 •

edited

Loading

yabola commented Feb 21, 2023 •

edited

Loading

yabola commented Mar 2, 2023 •

edited

Loading

sunchao commented Mar 13, 2023 •

edited

Loading

yabola Mar 21, 2023 •

edited

Loading

yabola Mar 30, 2023 •

edited

Loading

yabola Apr 16, 2023 •

edited

Loading

yabola commented Apr 24, 2023 •

edited

Loading

yabola commented May 17, 2023 •

edited

Loading

yabola commented Aug 1, 2024 •

edited

Loading