Parquet small file reading optimization #595

tgravescs · 2020-08-20T18:48:16Z

closes #333

This PR adds an option to improve the performance of reading small files with the Parquet reader. The current issue with the way Spark does the reading, is that within a task that is assigned multiple files to read, it just iterates over 1 file at a time. The plugin just extended that capability and for each file a task is reading, we read it on the CPU side into a host memory buffer, acquire the semaphore and then the GPU reads the parquet from the host memory buffer. This is inefficient because during this time the CPU isn't doing anything else and you end up with a lot of lock contention with the semaphore. This PR changes it such that we can read multiple small files on the CPU side into a host memory buffer, then once that has reached sufficient size then we acquire the semaphore and have the GPU read it.

This adds an option to be able to turn this optimization on and off. It defaults to be on: spark.rapids.sql.format.parquet.smallFiles.enabled

One issue I ran into was that the footer size we estimated for the host memory buffer was now not correct. From looking into it, it seems there are 2 offsets different with the new optimization. The issue is that these values are now larger than they were before because we are combining them, so estimating the size of the footer based on the original footers is to small. To handle this I added code to estimate what should be worse case. It also checks to see if we are going to go over our host memory buffer size even with the changes to estimated size, then we allocate a new buffer and copy the data before writing the footer. We have a final check that is we write over the size of the host memory buffer we throw an exception.

I added tests for this as well as added some more tests for things we weren't handling like bucketing. The bucketing test found another incompatibility with Databricks with FilePartition so I had to move that into shim. Also note for the bucketing test I had to enable hive in the tests so you need to have a Spark version that supports hive to run now.

currently with the small file optimization on we don't support the mergeSchema option. It falls back to the cpu. Which really I think I could just have to turn off the small file optimization, so maybe I'll file a followup for that.

I also added code that made a pass over the final plan to see if the user is asking to get the input_file_name, input_file_block_start, or input_file_block_length. If they are then we can't use the small file optimization because we are reading more than 1 file at a time so that api doesn't make sense.

I also check for the legacy parquet rebase mode stuff. In this case if a task is trying to read files with different modes then it throws an exception. We may be able to improve this handling by making it just split those files into separate batches but it needs more investigation.

One example of the performance improvements I see with this, is a query that ran on about 50000 small files.
The query with the small file optimization on took 12 minutes. With the small file optimization off it took 27 minutes.
I do plan on getting some traces to see if there are other areas of improvement here, but I think this version gives us a good start.

…into smallfiles

tgravescs · 2020-08-20T18:49:08Z

build

jlowe

@tgravescs about 75% through the review, posting what I have so far.

docs/configs.md

...ugin/src/main/scala/com/nvidia/spark/rapids/ColumnarPartitionReaderWithPartitionValues.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

jlowe · 2020-08-20T20:30:49Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+        out.write(ParquetPartitionReader.PARQUET_MAGIC)
+        val allOutputBlocks = scala.collection.mutable.ArrayBuffer[BlockMetaData]()
+        filesAndBlocks.foreach { case (file, blocks) =>
+          val in = file.getFileSystem(conf).open(file)


Nit: use withResource

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuFileSourceScanExec.scala

tgravescs · 2020-08-20T21:22:37Z

hmm might be a race condition in test test_simple_partitioned_read_fail_legacy. I'll look into it, passes locally.

revans2

I didn't get into great detail for all of the code. Will try to find more time to dig in deeper.

revans2 · 2020-08-21T15:14:13Z

integration_tests/src/main/python/parquet_test.py

@@ -98,14 +103,16 @@ def test_pred_push_round_trip(spark_tmp_path, parquet_gen, read_func, v1_enabled
    rf = read_func(data_path)
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark: rf(spark).select(f.col('a') >= s0),
-            conf={'spark.sql.sources.useV1SourceList': v1_enabled_list})
+            conf={'spark.rapids.sql.format.parquet.smallFiles.enabled': 'true',


shouldn't this be small_file_opt instead of 'true'

yes,I'll fix.

It looks like this requested change was missed?

thats weird, I fixed this, I must have somehow dropped it, I'll fix

…inor things. Signed-off-by: Thomas Graves <[email protected]>

…eForTest

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2020-08-25T02:46:23Z

ok, the main changes here are we change to pass parameter into GpuParquetScan and GpuFileSourceScanExec as to whether to enable the small file optimization. We use that now when looking for mergeSchema and the input_filename and such. Change so we look at the entire plan afterwards in the GpuTransitionOverrides and then we replace the GpuParquetScanExec or GpuFileSourceScanExec node if we fine an input_filename (or similar) being used and we replace the Exec node with one that has the parameter where small file optimization is off.
I also commonized some of the GpuParquetScan code for predication pushdowns and filtering into its own class - GpuParquetFileFilterHandler. I would have preferred it be a base class but that didn't work due to GpuParquetPartitionReaderFactory already extending abstract class FilePartitionReaderFactory.
I also think I addressed the rest of the existing comments.

tgravescs · 2020-08-25T02:47:18Z

build

tgravescs · 2020-08-25T03:04:04Z

failed to fetch spark 3.1.0 artifacts, rekicking

tgravescs · 2020-08-25T03:04:09Z

build

revans2 · 2020-08-25T14:29:28Z

docs/configs.md

@@ -56,6 +56,7 @@ Name | Description | Default Value
 <a name="sql.format.orc.write.enabled"></a>spark.rapids.sql.format.orc.write.enabled|When set to false disables orc output acceleration|true
 <a name="sql.format.parquet.enabled"></a>spark.rapids.sql.format.parquet.enabled|When set to false disables all parquet input and output acceleration|true
 <a name="sql.format.parquet.read.enabled"></a>spark.rapids.sql.format.parquet.read.enabled|When set to false disables parquet input acceleration|true
+<a name="sql.format.parquet.smallFiles.enabled"></a>spark.rapids.sql.format.parquet.smallFiles.enabled|When set to true, handles reading multiple small files within a partition more efficiently by combining multiple files on the CPU side before sending to the GPU. Recommended unless user needs mergeSchema option or has files with mixed legacy date/timestamps (spark.sql.legacy.parquet.datetimeRebaseModeInRead)|true


Can we have a follow on issue to try and more cleanly handle schema evolution and datetimeRebaseMode?

actually I changed it so that it does handle mixed datetimeRebaseMode by splitting the files into separate batches if it finds the mode on the files are different. I'll update the description here.
I'll file a follow on to improve schema evolution.

revans2 · 2020-08-25T14:34:33Z

shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/Spark310Shims.scala

+              case _ => false
+            }
+            val canUseSmallFileOpt = (isParquet && conf.isParquetSmallFilesEnabled &&
+              !(options.getOrElse("mergeSchema", "false").toBoolean ||


Just FYI schema evolution can happen even without mergeSchema. If the user just passes in their own schema you can hit the same thing.

thanks, filed followon issue #608

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2020-08-25T16:54:22Z

build

shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/Spark300Shims.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

jlowe · 2020-08-25T18:53:46Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuParquetScan.scala

+      val partitionValues = inPartitionValues.toSeq(partitionSchema)
+      val partitionScalars = ColumnarPartitionReaderWithPartitionValues
+        .createPartitionValues(partitionValues, partitionSchema)
+      try {


Nit: withResource

jlowe · 2020-08-25T19:57:57Z

integration_tests/src/main/python/parquet_test.py

@@ -98,14 +103,16 @@ def test_pred_push_round_trip(spark_tmp_path, parquet_gen, read_func, v1_enabled
    rf = read_func(data_path)
    assert_gpu_and_cpu_are_equal_collect(
            lambda spark: rf(spark).select(f.col('a') >= s0),
-            conf={'spark.sql.sources.useV1SourceList': v1_enabled_list})
+            conf={'spark.rapids.sql.format.parquet.smallFiles.enabled': 'true',


It looks like this requested change was missed?

Signed-off-by: Thomas Graves <[email protected]>

tgravescs · 2020-08-26T00:20:36Z

build

* Initial prototype small filees parquet * Change datasource v1 to use small files * Working but has 72 bytes off in size * Copy filesourcescan to databricks and fix merge error * Fix databricks package name * Try to debug size calculation - adds lots of warnings * Cleanup and have file source scan small files only work for parquet * Switch to use ArrayBuffer so order correct * debug * Fix order issue * add more to calculated size * cleanup * Try to handle partition values * fix passing partitionValues * refactor * disable mergeschema * add check for mergeSchema * Add tests for both small file optimization on and off * hadnle input file - but doesn't totally work * remove extra values reader * Fixes * Debug * Check to see if Inputfile execs used * Finding InputFileName works * finding input file working * cleanup and add tests for V2 datasource * Add check for input file to GpuParquetScan * Add more tests * Add GPU metrics to GpuFileSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * remove log messages * Docs * cleanup * Update 300db and 310 FileSourceScanExecs passing unit tests * Add test for bucketing * Add in logic for datetime corrected rebase mode * Commonize some code * Cleanup * fixes * Extract GpuFileSourceScanExec from shims Signed-off-by: Jason Lowe <[email protected]> * Add more tests * comments * update test * Pass metrics via GPU file format rather than custom options map Signed-off-by: Jason Lowe <[email protected]> * working * pass schema around properly * fix value from tuple * Rename case class * Update tests * Update code checking for DataSourceScanExec Signed-off-by: Jason Lowe <[email protected]> * Fix scaladoc warning and unused imports Signed-off-by: Jason Lowe <[email protected]> * Add realloc if over memory size * refactor memory checks * Fix copyright Signed-off-by: Jason Lowe <[email protected]> * Upmerge to latest FileSourceScanExec changes for metrics * Add missing check Filesource scan mergeSchema and cleanup * Cleanup * remove bucket test for now * formatting * Fixes * Add more tests * Merge conflict Signed-off-by: Thomas Graves <[email protected]> * Fix merge conflict Signed-off-by: Thomas Graves <[email protected]> * enable parquet bucket tests and change warning Signed-off-by: Thomas Graves <[email protected]> * cleanup Signed-off-by: Thomas Graves <[email protected]> * remove debug logs Signed-off-by: Thomas Graves <[email protected]> * Move FilePartition creation to shim Signed-off-by: Thomas Graves <[email protected]> * Add better message for mergeSchema Signed-off-by: Thomas Graves <[email protected]> * Address review comments. Add in withResources and closeOnExcept and minor things. Signed-off-by: Thomas Graves <[email protected]> * Fix spacing Signed-off-by: Thomas Graves <[email protected]> * Fix databricks support and passing arguments Signed-off-by: Thomas Graves <[email protected]> * fix typo in db Signed-off-by: Thomas Graves <[email protected]> * Update config description Signed-off-by: Thomas Graves <[email protected]> * Rework Signed-off-by: Thomas Graves <[email protected]> Co-authored-by: Thomas Graves <[email protected]> Co-authored-by: Jason Lowe <[email protected]>

Allow using run-in-docker when forked by a process without a tty Signed-off-by: Gera Shegalov <[email protected]>

tgravescs and others added 30 commits July 30, 2020 09:20

Initial prototype small filees parquet

484c781

Change datasource v1 to use small files

1d3dd3f

Working but has 72 bytes off in size

c168214

Merge remote-tracking branch 'origin/branch-0.2' into smallfiles

b2c2959

Copy filesourcescan to databricks and fix merge error

ccdf32d

Fix databricks package name

40c41e2

Try to debug size calculation - adds lots of warnings

5afddf0

Merge remote-tracking branch 'origin/branch-0.2' into smallfiles

566520e

Cleanup and have file source scan small files only work for parquet

5054c8a

Switch to use ArrayBuffer so order correct

5c0cee4

debug

048b4ff

Fix order issue

3117550

add more to calculated size

4857510

cleanup

64981ce

Try to handle partition values

560cc81

fix passing partitionValues

84dc48e

refactor

bbade25

disable mergeschema

13fbd4d

add check for mergeSchema

73a212a

Merge remote-tracking branch 'origin/branch-0.2' into smallfiles

a87af11

Add tests for both small file optimization on and off

c5b8a6e

hadnle input file - but doesn't totally work

d2ac90a

remove extra values reader

3e014a8

Fixes

c13d8f3

Debug

6c53c45

Merge remote-tracking branch 'origin/branch-0.2' into smallfiles

8965b81

Check to see if Inputfile execs used

1c05dc4

Merge branch 'smallfiles' of https://github.com/tgravescs/spark-rapids …

ed7d7fb

…into smallfiles

Finding InputFileName works

95abe40

finding input file working

c75df6c

jlowe reviewed Aug 20, 2020

View reviewed changes

revans2 reviewed Aug 21, 2020

View reviewed changes

tgravescs and others added 5 commits August 24, 2020 15:36

Address review comments. Add in withResources and closeOnExcept and m…

7b19a0c

…inor things. Signed-off-by: Thomas Graves <[email protected]>

Merge remote-tracking branch 'origin/branch-0.2' into smallfilesRebas…

97913e2

…eForTest

Fix spacing

8370651

Signed-off-by: Thomas Graves <[email protected]>

Fix databricks support and passing arguments

ef4aa7e

Signed-off-by: Thomas Graves <[email protected]>

fix typo in db

7557d71

Signed-off-by: Thomas Graves <[email protected]>

revans2 reviewed Aug 25, 2020

View reviewed changes

tgravescs mentioned this pull request Aug 25, 2020

[FEA] Parquet small file optimization improve handle merge schema #608

Closed

Update config description

fd942a3

Signed-off-by: Thomas Graves <[email protected]>

jlowe reviewed Aug 25, 2020

View reviewed changes

Rework

8a29098

Signed-off-by: Thomas Graves <[email protected]>

jlowe approved these changes Aug 26, 2020

View reviewed changes

tgravescs merged commit dcd119c into NVIDIA:branch-0.2 Aug 26, 2020

tgravescs deleted the smallfilesRebase branch August 26, 2020 14:34

tgravescs self-assigned this Aug 26, 2020

mythrocks mentioned this pull request Nov 7, 2022

Support small-file read optimization for Hive delimited text input #7017

Open

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Make docker run -i -t optional via EXTRA_ARGS (NVIDIA#595)

bba5293

Allow using run-in-docker when forked by a process without a tty Signed-off-by: Gera Shegalov <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet small file reading optimization #595

Parquet small file reading optimization #595

tgravescs commented Aug 20, 2020

tgravescs commented Aug 20, 2020

jlowe left a comment

jlowe Aug 20, 2020

tgravescs commented Aug 20, 2020

revans2 left a comment

revans2 Aug 21, 2020

tgravescs Aug 24, 2020

jlowe Aug 25, 2020

tgravescs Aug 25, 2020

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

revans2 Aug 25, 2020

tgravescs Aug 25, 2020

revans2 Aug 25, 2020

tgravescs Aug 25, 2020

tgravescs commented Aug 25, 2020

jlowe Aug 25, 2020

jlowe Aug 25, 2020

tgravescs commented Aug 26, 2020

Parquet small file reading optimization #595

Parquet small file reading optimization #595

Conversation

tgravescs commented Aug 20, 2020

tgravescs commented Aug 20, 2020

jlowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Aug 20, 2020

revans2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

tgravescs commented Aug 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Aug 25, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgravescs commented Aug 26, 2020