chore: [comet-parquet-exec] merge from main 20240116 #1299

parthchandra · 2025-01-17T00:31:41Z

Merge from main

* feat: support array_append * formatted code * rewrite array_append plan to match spark behaviour and fixed bug in QueryPlan serde * remove unwrap * Fix for Spark 3.3 * refactor array_append binary expression serde code * Disabled array_append test for spark 4.0+

…ry allocator (apache#1063)

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

… config (apache#1087)

* Update version number for build * update docs

apache#1091)

…che#1093)

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

## Which issue does this PR close? Closes apache#1067 ## Rationale for this change Bug fix. A few expressions were failing some unsigned type related tests ## What changes are included in this PR? - For `u8`/`u16`, switched to use `generate_cast_to_signed!` in order to copy full i16/i32 width instead of padding zeros in the higher bits - `u64` becomes `Decimal(20, 0)` but there was a bug in `round()` (`>` vs `>=`) ## How are these changes tested? Put back tests for unsigned types

* include first batch in ScanExec metrics * record row count metric * fix regression

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

…apache#1108)

* Part of the implementation of array_insert * Missing methods * Working version * Reformat code * Fix code-style * Add comments about spark's implementation. * Implement negative indices + fix tests for spark < 3.4 * Fix code-style * Fix scalastyle * Fix tests for spark < 3.4 * Fixes & tests - added test for the negative index - added test for the legacy spark mode * Use assume(isSpark34Plus) in tests * Test else-branch & improve coverage * Update native/spark-expr/src/list.rs Co-authored-by: Andy Grove <[email protected]> * Fix fallback test In one case there is a zero in index and test fails due to spark error * Adjust the behaviour for the NULL case to Spark * Move the logic of type checking to the method * Fix code-style --------- Co-authored-by: Andy Grove <[email protected]>

…apache#1086) * enable decimal to decimal cast of different precision and scale * add more test cases for negative scale and higher precision * add check for compatibility for decimal to decimal * fix code style * Update spark/src/main/scala/org/apache/comet/expressions/CometCast.scala Co-authored-by: Andy Grove <[email protected]> * fix the nit in comment --------- Co-authored-by: himadripal <[email protected]> Co-authored-by: Andy Grove <[email protected]>

* fix: Use RDD partition index * fix * fix * fix

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

* fix metrics issues * clippy * update tests

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

* Remove unused StringView struct * remove more dead code

* add some notes on shuffle * reads * improve docs

## Which issue does this PR close? Part of apache#372 and apache#551 ## Rationale for this change To be ready for Spark 4.0 ## What changes are included in this PR? This PR enables more Spark 4.0 tests that were fixed by recent changes ## How are these changes tested? tests enabled

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

…ark grouping (apache#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <[email protected]>

…ache#1220) Co-authored-by: Andy Grove <[email protected]>

## Which issue does this PR close? ## Rationale for this change Because `isCometShuffleEnabled` is false by default, some tests were not reached ## What changes are included in this PR? Removed `isCometShuffleEnabled` and updated spark test diff ## How are these changes tested? existing test

…ing (apache#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <[email protected]>

… in window aggregates (apache#1253)

…huffle when native shuffle is not supported (apache#1209)

* wip: array remove * added comet expression test * updated test cases * fixed array_remove function for null values * removed commented code * remove unnecessary code * updated the test for 'array_remove' * added test for array_remove in case the input array is null * wip: case array is empty * removed test case for empty array

* fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment

…formance (apache#1190) * Implement faster encoder for shuffle blocks * make code more concise * enable fast encoding for columnar shuffle * update benches * test all int types * test float * remaining types * add Snappy and Zstd(6) back to benchmark * fix regression * Update native/core/src/execution/shuffle/codec.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * address feedback * support nullable flag --------- Co-authored-by: Liang-Chi Hsieh <[email protected]>

* fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <[email protected]> * address review comments --------- Co-authored-by: Andy Grove <[email protected]>

* Add changelog * revert accidental change * move 2 items to performance section

* fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment

* start 0.6.0 development * update some docs * Revert a change * update CI

* fix links and provide complete scripts * fix path * fix incorrect text

andygrove

Thanks @parthchandra

andygrove · 2025-01-17T18:54:06Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

@@ -190,7 +190,7 @@ class CometSparkSessionExtensions

          // data source V1
          case scanExec @ FileSourceScanExec(
-                HadoopFsRelation(_, partitionSchema, _, _, _: ParquetFileFormat, _),
+                HadoopFsRelation(_, partitionSchema, _, _, _, _),


This doesn't seem correct, because it would make us replace non-parquet scans as well.

main branch has:

// data source V1 case scanExec @ FileSourceScanExec( HadoopFsRelation(_, partitionSchema, _, _, fileFormat, _), _: Seq[_], requiredSchema, _, _, _, _, _, _) if CometScanExec.isFileFormatSupported(fileFormat) && CometScanExec.isSchemaSupported(requiredSchema) && CometScanExec.isSchemaSupported(partitionSchema) =>

parthchandra · 2025-01-18T01:45:05Z

@andygrove I added another fix since you approved. This was also a merge issue and caused 6 test failures if native_datafusion was enabled.
Latest count :

native_comet: Tests: succeeded 797, failed 0, canceled 2, ignored 50, pending 0
native_datafusion: Tests: succeeded 628, failed 40, canceled 2, ignored 50, pending 0
native_iceberg_compat: Tests: succeeded 623, failed 45, canceled 2, ignored 50, pending 0

NoeB and others added 30 commits November 13, 2024 16:57

chore: Simplify CometShuffleMemoryAllocator to use Spark unified memo…

c32bf0c

…ry allocator (apache#1063)

docs: Update benchmarking.md (apache#1085)

f3da844

feat: Require offHeap memory to be enabled (always use unified memory) (

2c832b4

apache#1062) * Require offHeap memory * remove unused import * use off heap memory in stability tests * reorder imports

test: Restore one test in CometExecSuite by adding COMET_SHUFFLE_MODE…

7cec285

… config (apache#1087)

Add changelog for 0.4.0 (apache#1089)

10ef62a

chore: Prepare for 0.5.0 development (apache#1090)

0c9a403

* Update version number for build * update docs

build: Skip installation of spark-integration and fuzz testing modules (

406ffef

apache#1091)

Add hint for finding the GPG key to use when publishing to maven (apa…

bfd7054

…che#1093)

docs: Update documentation for 0.4.0 release (apache#1096)

59da6ce

* update TPC-H results * update Maven links * update benchmarking guide and add TPC-DS results * include q72

chore: Include first ScanExec batch in metrics (apache#1105)

b64c13d

* include first batch in ScanExec metrics * record row count metric * fix regression

chore: Improve CometScan metrics (apache#1100)

19dd58d

* Add native metrics for plan creation * make messages consistent * Include get_next_batch cost in metrics * formatting * fix double count of rows

chore: Add custom metric for native shuffle fetching batches from JVM (…

e602305

…apache#1108)

docs: fix readme FGPA/FPGA typo (apache#1117)

7b1a290

fix: Use RDD partition index (apache#1112)

5400fd7

* fix: Use RDD partition index * fix * fix * fix

fix: Various metrics bug fixes and improvements (apache#1111)

ebdde77

fix: Don't create CometScanExec for subclasses of ParquetFileFormat (a…

9b250c4

…pache#1129) * Use exact class comparison for parquet scan * Add test * Add comment

fix: Fix metrics regressions (apache#1132)

95727aa

* fix metrics issues * clippy * update tests

docs: Add more technical detail and new diagram to Comet plugin overv…

36a2307

…iew (apache#1119) * Add more technical detail and new diagram to Comet plugin overview * update diagram * add info on Arrow IPC * update diagram * update diagram * update docs * address feedback

Stop passing Java config map into native createPlan (apache#1101)

2671e0c

feat: Improve ScanExec native metrics (apache#1133)

8d7bcb8

* save * remove shuffle jvm metric and update tuning guide * docs * add source for all ScanExecs * address feedback * address feedback

chore: Remove unused StringView struct (apache#1143)

587c29b

* Remove unused StringView struct * remove more dead code

docs: Add some documentation explaining how shuffle works (apache#1148)

b95dc1d

* add some notes on shuffle * reads * improve docs

chore: Refactor cast to use SparkCastOptions param (apache#1146)

8d83cc1

* Refactor cast to use SparkCastOptions param * update tests * update benches * update benches * update benches

Enable more scenarios in CometExecBenchmark. (apache#1151)

21503ca

chore: Move more expressions from core crate to spark-expr crate (apa…

73f1405

…che#1152) * move aggregate expressions to spark-expr crate * move more expressions * move benchmark * normalize_nan * bitwise not * comet scalar funcs * update bench imports

rluvaton and others added 20 commits January 7, 2025 23:10

chore: extract predicate_functions expressions to folders based on sp…

fbcf025

…ark grouping (apache#1218) * extract predicate_functions expressions to folders based on spark grouping * code review changes --------- Co-authored-by: Andy Grove <[email protected]>

build(deps): bump protobuf version to 3.21.12 (apache#1234)

ca7b4a8

extract json_funcs expressions to folders based on spark grouping (ap…

c6acc9d

…ache#1220) Co-authored-by: Andy Grove <[email protected]>

chore: extract hash_funcs expressions to folders based on spark group…

e731b6e

…ing (apache#1221) * extract hash_funcs expressions to folders based on spark grouping * extract hash_funcs expressions to folders based on spark grouping --------- Co-authored-by: Andy Grove <[email protected]>

fix: Fall back to Spark for unsupported partition or sort expressions…

be48839

… in window aggregates (apache#1253)

perf: Improve query planning to more reliably fall back to columnar s…

d15d051

…huffle when native shuffle is not supported (apache#1209)

fix regression (apache#1259)

d52038e

fix: Fall back to Spark for distinct aggregates (apache#1262)

e8261fb

* fall back to Spark for distinct aggregates * update expected plans for 3.4 * update expected plans for 3.5 * force build * add comment

docs: Update TPC-H benchmark results (apache#1257)

1eb932a

fix: disable initCap by default (apache#1276)

9fe5420

* fix: disable initCap by default * Update spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala Co-authored-by: Andy Grove <[email protected]> * address review comments --------- Co-authored-by: Andy Grove <[email protected]>

chore: Add changelog for 0.5.0 (apache#1278)

cbe50e1

* Add changelog * revert accidental change * move 2 items to performance section

update TPC-DS results for 0.5.0 (apache#1277)

08d892a

fix: cast timestamp to decimal is unsupported (apache#1281)

9c1f0ee

* fix: cast timestamp to decimal is unsupported * fix style * revert test name and mark as ignore * add comment

chore: Start 0.6.0 development (apache#1286)

d36e8d7

* start 0.6.0 development * update some docs * Revert a change * update CI

docs: Fix links and provide complete benchmarking scripts (apache#1284)

3eced67

* fix links and provide complete scripts * fix path * fix incorrect text

feat: Add HasRowIdMapping interface (apache#1288)

82022af

Merge branch 'main' into comet-parquet-exec-merge-20240116

9e4e5e8

parthchandra changed the title ~~Comet parquet exec merge 20240116~~ chore: Comet parquet exec merge 20240116 Jan 17, 2025

andygrove approved these changes Jan 17, 2025

View reviewed changes

fix style

8083086

andygrove changed the title ~~chore: Comet parquet exec merge 20240116~~ chore: [comet-parquet-exec] merge from main 20240116 Jan 17, 2025

andygrove reviewed Jan 17, 2025

View reviewed changes

parthchandra added 2 commits January 17, 2025 13:54

fix

5a31ba3

fix for plan serialization

8fee4ca

andygrove approved these changes Jan 18, 2025

View reviewed changes

andygrove merged commit c17a0f6 into apache:comet-parquet-exec Jan 18, 2025
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: [comet-parquet-exec] merge from main 20240116 #1299

chore: [comet-parquet-exec] merge from main 20240116 #1299

parthchandra commented Jan 17, 2025

andygrove left a comment

andygrove Jan 17, 2025

parthchandra Jan 17, 2025

parthchandra commented Jan 18, 2025

chore: [comet-parquet-exec] merge from main 20240116 #1299

chore: [comet-parquet-exec] merge from main 20240116 #1299

Conversation

parthchandra commented Jan 17, 2025

andygrove left a comment

Choose a reason for hiding this comment

andygrove Jan 17, 2025

Choose a reason for hiding this comment

parthchandra Jan 17, 2025

Choose a reason for hiding this comment

parthchandra commented Jan 18, 2025